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Abstract 

This  thesis  addresses  the  problem  of  detecting  of  a  specific  type  of  nondeterminism  in 
shared  memory  parallel  programs  known  as  access  anomalies.  An  access  ainomaly  occurs 
when  an  update  to  a  sheired  variable  X  is  concurrent  with  either  a  read  of  X  or  another 
update  of  X. 

The  first  part  of  the  work  considers  dynamic  detection  of  access  zinomalies.  We  intro- 
duce a  new  technique  called  task  recycling  that  detects  access  anomalies  "on  the  fly"  by 
monitoring  the  program  execution.  This  technique  is  designed  with  two  goals  in  mind. 
The  first  goal  is  minimal  monitoring  overhead.  Costs  are  incurred  only  at  thread  create, 
terminate,  and  coordinate  operations  cind  every  time  a  monitored  variable  is  accessed. 
Because  variable  accesses  axe  generally  the  most  frequent  operation,  the  task  recycling 
technique  reduces  the  overhead  per  vziriable  access  to  a  smcdl  constant.  The  second  goal 
is  generality.  The  task  recycling  technique  is  appUcable  to  a  wide  vairiety  of  pjiraUel  con- 
structs cind  all  common  synchronous  and  asynchronous  coordination  primitives.  Combined 
with  a  protocol  for  specifying  ordering  constrciints,  the  method  of  represen"  ng  concur- 
rency relationships  in  task  recycling  cam  be  extended  to  detect  general  race  conditions  in 
parallel  programs. 

The  second  pait  of  the  thesis  involves  static  detection  of  several  types  of  nondetermin- 
ism that  makes  dynamic  anomcily  detection  inefficient.  In  particulair,  the  notion  of  nonde- 
terminism arising  from  critical  section  coordination  is  refined  by  distinguishing  between 
three  types  of  nondeterminism  —  parallel,  sequential,  and  reference  nondeterminism.  The 
presence  of  these  types  of  nondeterminism  in  a  program  impacts  access  anomaly  detection 
in  two  significant  ways:  (i)  how  critical  section  coordination  is  modeled  during  anomaly 
detection,  and  (ii)  the  confidence  level  and  complexity  of  guaranteeing  that  a  progrzim 
has  no  access  anomalies.  In  particular,  it  is  shown  that  access  anomalies  can  be  detected 
efficiently  only  if  a  program  is  paraJlel,  sequential  and  reference  deterministic.  Heuristics 
are  presented  that  make  access  anomaly  detection  tractable  in  the  presence  of  other  non- 
determinism through  a  better  classification  amd  semantic  understanding  of  a  coordination 
protocol. 
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Chapter  1 


Introduction 


The  growth  in  popularity  of  multiprocessors  in  recent  years  is  indicative  of  cin  increasing 
focus  on  achieving  program  speed-ups  through  paiiJlelism.  In  order  to  decrease  execution 
time,  a  program  that  executes  on  the  multiprocessor  must  be  a  parallel  program.  A  parallel 
program  is  a  set  of  concurrently  executing  sequential  processes  that  cooperate  to  solve 
a  common  problem.  A  pairticularly  popular  class  of  multiprocessors  is  general  purpose 
multiple-instruction  multiple  data  strezim  (MEMD)  eirchitectures  with  shared  memory. 
Notable  examples  of  this  class  of  architecture  include  the  Sequent,  Alliant,  Ultracomputer 
and  Cedar  mialtiprocessors  [Seq89,PM86,KLC+86,Got88]. 

Shared  memory  is  used  in  parallel  progremis  to  achieve  coordination  and  synchroniza- 
tion. For  example,  writing  to  a  shaired  memory  location  broadcasts  information  to  other 
pjiraUel  threads,  and  signaling  through  a  shared  memory  location  synchronizes  the  exe- 
cution of  pciraUel  threads.  When  shared  memory  is  used  for  these  purposes,  concurrent 
access  to  the  same  memory  location  is  expected.  A  completely  different  use  of  shared 
memory  is  for  storing  global  shared  data  structures,  such  as  large  arrays  in  niuneric  code 
and  tables  and  queues  in  systems  softweire.  The  memory  locations  of  these  shau'ed  data 
objects  cire  generally  not  intended  to  be  accessed  concurrently.  For  example,  shared  ai- 
rays  in  scientific  pcirallel  prograims  are  often  partitioned  into  sub-regions  so  that  different 
threads  can  perform  computations  for  each  sub-region  concurrently  eind  aisynchronously. 
However,  when  concurrent  threads  access  the  sjime  element  in  the  array  aind  at  least  one 
access  is  a  write  operation,  a  coordination  mechanism  must  be  used  to  ensure  correct 
access. 

Erroneous  behavior  in  shared  memory  parallel  programs  is  often  due  to  access  anoma- 
lies. An  access  anomaly  occurs  when  two  concurrent  execution  threads  access  the  sajoae 
memory  location  in  an  "unsafe"  manner:  more  specifically,  when  either  two  concurrent 
threads  both  write,  or  one  reads  cind  one  writes  a  shiired  memory  location  without  coor- 
dinating these  accesses.  This  thesis  examines  the  problem  of  access  cinomaly  detection. 


The  progTciin  segment  in  Figure  1  illustrates  the  concept  of  an  access  anomaly.  The 
doall  construct  creates  two  parallel  threads  that  execute  concurrently.     Variable  Y  is 


Y:=l 


X:=Y+1 


X:=Y+2 


Z:=X+Y 


Y  :=  1 

doall  t  :=  1  to  2 
X  :=Y  +  i 
endall 
Z  :=  X+Y 


Figure  1.1:  Simple  Program  With  cin  Access  Anoraaly 


accessed  by  both  concurrent  threads.  Because  Y  is  only  read  concurrently,  these  accesses 
are  safe.  Similarly,  the  assignment  to  y  in  the  first  statement  does  not  cause  am  anomaly 
since  this  write  is  always  performed  before  either  read.  However,  because  both  concurrent 
iterates  write  the  variable  X,  these  accesses  to  X  are  ainomalous.  The  final  value  assigned 
to  X  is  either  2,  3  or  4  depending  on  the  relative  execution  speed  of  the  concurrent  threads. 

An  access  cinomfily  is  a  specific  mstzince  of  a  generaJ  type  of  behavior  in  parallel 
progTiims  known  as  a  race  condition.  A  race  condition  exists  when  the  execution  of  two 
operations  that  are  not  independent  Ccin  occur  in  either  order  during  the  execution  of  a 
program.  WhUe  certcdn  race  conditions  do  not  cifFect  the  outcome  of  progTcims  and  are 
safe  (e.g.  the  nondeterministic  order  in  which  a  lock  is  greinted),  access  anomalies  often 
introduce  nondeterminism  that  modifies  execution  results. 

It  is  essenticd  that  access  anomalies  cire  detected  and  reported  to  the  user  or  support 
enviroimient.  Perhaps  the  most  important  reason  is  that  access  anomedies  are  usually 
bugs.  In  fact,  an  Ada  program  that  contains  an  access  zinomady  is  defined  to  be  er- 
roneous. Although  some  programs  aie  designed  to  contain  this  type  of  behavior,  the 
nondeterminism  stemming  from  access  anomalies  is  usually  imintentioncJ.  Access  anoma- 
lies typically  result  from  incorrect  interthread  coordination,  as  in  missing  lock  operations, 
or  program  logic  errors,  as  in  incorrect  array  referencing. 

Second,  it  is  often  useftil  to  be  able  to  repeat  an  execution  of  a  parallel  prograim  for 
debugging  purposes.  It  is  not  possible  to  reproduce  a  given  execution  instance  E  unless 
all  race  condition  in  E  are  detected  and  forced  to  occur  in  the  same  order  in  subsequent 


re-executions.    If  access  anomalies  are  not  detected,  trace-and-replay  debuggers  are  of 
little  use  in  debugging  shaired  memory  parcdlel  programs. 

Lastly,  concurrent  threads  in  a  sheired  memory  pziradlel  program  can  communicate 
through  either  shcired  memory  or  coordination  primitives.  By  detecting  access  einomalies, 
zlU  interaction  among  concurrent  threads  is  made  explicit.  Therefore,  techniques  that 
have  been  developed  for  cinalyzing  and  evaluating  distributed  memory  programs  can  be 
applied  to  shaied  memory  prograims. 

Unfortunately,  traditional  debugging  techniques  are  of  limited  use  in  finding  access 
anomalies,  since  such  program  behaviors  are  sensitive  to  execution  timing.  Several  alter- 
native detection  methods  have  been  proposed.  One  approach  uses  static  analysis  to  isolate 
a  set  of  potentiad  access  anomalies  based  on  a  static  representation  of  the  program.  The 
primary  benefit  of  static  detection  is  that  it  provides  general  information  about  the  pro- 
gram, rather  than  with  respect  to  a  given  input  vector.  However,  because  static  einalysis 
is  conservative,  the  nimiber  of  potential  anomalies  reported  may  be  very  large.  Inaccura- 
cies stem  from  variable  ediasing,  pointers,  and  difficiilties  in  determining  if  two  portions 
of  code  can  ever  execute  concurrently. 

A  second  approach  uses  dynajnic  anaJysis  to  detect  access  anomalies  in  a  single  exe- 
cution instance  of  a  progriim.  Dynamic  ainalysis  complements  static  analysis  by  locating 
cinomzilies  precisely  for  that  execution  instance.  In  trace-based  detection,  aU  accesses  to 
shared  variables  and  cdl  parallel  operations  are  traced  during  program  execution.  This 
information  is  analyzed  after  the  program  completes.  In  on-the-fly  detection  of  access 
anomalies,  anomalies  are  detected  as  the  program  executes  through  the  use  of  monitor- 
ing techniques.  An  access  einomaly  is  discovered  dtrring  program  execution  in  much  the 
same  way  eis  eirray  subscript  out-of-rzinge  exceptions  aie  ciirrently  reported.  As  soon  as 
an  anomaly  is  detected,  an  exception  is  raised  specifying  the  variable  involved  as  well  as 
the  instruction  counters.  The  primary  advaintages  of  on-the-fly  detection  over  trace-bzised 
dynamic  methods  derive  from  data  compression. 

This  thesis  is  orgeinized  as  follows.  Chapter  2  defines  a  partial  order  execution  graph 
that  represents  the  "happens-before"  partifd  order  relationship  for  an  execution  instance 
of  a  pciTcdlel  program  and  describes  how  common  thread  creation,  termination  and  coor- 
dination primitives  are  modeled  in  the  pcirtial  order  execution  graph.  Chapter  2  also  gives 
a  fraimework  for  determining  when  a  single  execution  instance  is  sufficient  for  a  dynamic 
anomaly  detection  cdgorithm  to  gueireintee  that  a  given  prograim  and  input  vector  pair  will 
never  have  an  access  anomaly. 

Chapter  3  presents  the  task  recj/c/mg  technique,  a  new  on-the-fly  algorithm  for  detect- 
ing anomalies.  Task  recycling  improves  upon  existing  dynamic  access  amomaly  detection 
techniques  in  three  significant  ways: 


1.  Generality.  The  task  recycling  technique  is  applicable  to  a  wide  variety  of  peirallel 
constructs  and  all  common  synchronous  and  eisynchronous  coordination  primitives. 

2.  Performance.  Because  Vciriable  accesses  axe  generjdly  the  most  frequent  operation, 
the  task  recycling  technique  reduces  the  overhead  per  variable  access  to  a  small 
constcint  at  the  expense  of  a  higher  cost  incurred  at  parallel  operations. 

3.  Storage.  The  amotmt  of  information  stored  by  the  task  recycling  technique  depends 
only  on  the  maximum  concurrency  of  the  program,  rather  than  the  length  of  the 
program  execution. 

The  approaches  used  in  task  recycling  can  be  integrated  into  other  parallel  progrjim 
debugging  tools  to  improve  their  functionality. 

Chapter  4  presents  empiricad  measurements  of  on-the-fly  detection  as  a  general  access 
anomaly  detection  technique  and  task  recycling  as  a  specific  algorithm.  Empiriccil  data 
indicates  that  the  task  recycling  algorithm  performs  better  than  existing  on-the-fly  tech- 
niques for  a  wide  class  of  parallel  programs.  Moreover,  the  overhead  incurred  by  task 
recycling  is  small  enough  to  maJce  it  a  viable  tool  for  debugging  pziraUel  programs. 

One  of  the  primary  drawbacks  of  dynamic  detection  is  that  it  can  find  only  those 
einomalies  that  occur  in  a  single  execution  instzmce.  In  order  to  guarantee  that  there  are 
no  anomalies  in  a  given  program  and  input  vector  pair,  one  may  have  to  exzimine  many 
different  execution  instances.  Chapter  5  presents  a  new  representation  for  critical  section 
coordination  that  decreases  the  nimiber  of  execution  instances  that  must  be  considered. 
Static  analysis  techniques  are  used  to  determine  when  this  representation  is  guciranteed 
to  accurately  model  the  program  behavior. 


1.1      Existing  Work 


The  notion  of  access  einomalles  was  first  defined  in  terms  of  Read  and  Write  sets  by 
Bernstein  [Ber66].  Each  sequence  of  instructions  in  a  parallel  progreim  has  a  Read  and 
a  Write  set  associated  with  it:  the  Read  set  consists  of  those  variables  that  aire  read 
in  the  instruction  sequence  and  the  Write  set,  those  vjiriables  that  ase  written.  If  two 
potentially  conciirrent  instruction  sequences  Si  and  Sj  meet  the  following  three  conditions, 
all  execution  orderings  of  5,-  and  Sj  are  guareinteed  to  produce  the  same  result  (i.e.  Si 
amd  Sj  do  not  introduce  einy  external  non determinism): 

Bernstein's  Conditions: 

Read{Si)  0  Write(Sj)  =  0 
Write{Si)  n  Read{Sj)  =  0 
Wrtte(Si)  n   Wrtte{Sj)  =  0 


If  these  conditions  are  not  met,  concurrent  execution  of  Si  and  Sj  may  lead  to  access 
anomalies  that  alter  the  result  of  the  execution. 

In  order  to  evaluate  Bernstein's  conditions,  two  types  of  information  must  be  available: 

1.  Which  instruction  sequences  potentially  execute  concurrently,  and 

2.  Which  variables  aire  accessed  by  each  instruction  sequence. 

One  of  the  primary  goals  of  ein  anomaly  detection  algorithm  is  to  store  and  analyze  this 
information  as  efficiently  as  possible. 

1.1.1      Static  Anomaly  Detection 

In  static  detection  of  access  jmomalies,  a  graph  which  represents  the  static  structure  of 
the  program  is  cincdyzed  and  a  set  of  potential  access  anomalies  is  reported.  Traditional 
sequentizJ  control  flow  analysis  techniques  are  used  to  determine  the  set  of  variables  read 
£ind  written  in  each  node  in  the  static  progreim  graph.  Therefore,  the  primairy  issue  in 
static  detection  of  access  einomalies  is  that  of  determining  if  two  nodes  can  potentially 
execute  concurrently. 

One  of  the  first  static  einalysis  algorithms  for  detecting  access  anomalies  is  due  to  Tay- 
lor [TO80,Tay83].  The  primary  emphasis  in  this  work  was  the  anadysis  of  Ada  progremis; 
however,  the  techniques  developed  cjin  be  used  to  analyze  amy  language  with  a  rendezvous 
based  coordination  mechanism.  An  annotated  flow  graph  is  created  for  every  Ada  task. 
Because  Taylor  et  al.  are  interested  in  intertask  interaction,  each  node  contaiins  a  sum- 
mary of  sequentizd  operations  performed  by  the  code  associated  with  that  node.  This 
technique  of  sequential  abstraction  is  common  to  virtuzdly  all  of  the  static  representation 
techniques. 

The  analysis  of  annotated  flow  graphs  is  simileir  to  sjmibolic  execution.  A  concurrency 
state  graph  is  generated  by  simulating  all  possible  coordination  sequences.  Each  node  in 
the  graph  is  a  concurrency  state,  representing  a  set  of  teisks  that  cam  execute  conctirrently. 
Directed  edges  in  the  graph  indicate  which  concvurency  states  can  follow  each  other  in 
a  vadid  execution.  Thus,  the  path  from  the  root  to  a  given  state  in  the  graph  repre- 
sents a  valid  concurrency  history  and  describes  one  possible  sequence  of  synchronization 
events  which  can  occur.  Access  anomalies  axe  then  detected  by  analyzing  the  concurrency 
histories. 

The  strength  of  this  approach  is  its  accuracy.  Instead  of  attempting  to  be  conservative 
and  summarizing  information  to  worst  case  scenarios,  all  possible  scensirios  are  considered 
and  thus  more  exact  Jinswers  are  possible.  This  accuracy  is  essential  for  some  issues  in 
parallel  prograim  ainalysis  (for  example,  the  formal  verification  of  Ada  tasking  programs 
[DKH88]).  The  primary  weakness  of  this  approach  is  that  the  number  of  possible  con- 
ciurency  states  is  exponential  in  the  size  of  the  program.  In  addition,  it  is  very  difficult 


to  model  programs  which  dynamically  create  tasks,  either  with  recursive  procedures  or 
pointers. 

Appelbe  and  McDowell  extend  Taylor's  work  to  parallel  Fortran  progTcims  and  use 
static  einalysis  to  generate  input  to  a  dynamic  monitoring  system  [AM85,AM88,McD88]. 
Their  work  in  static  ancJysis  of  pairallel  progr£ims  is  just  one  component  of  a  system  for 
generating  miiltitasking  application  programs.  Static  anailysis  is  performed  by  creating  a 
standard  control  flow  graph  and  then  compressing  it  into  a  synchronization  graph  which 
contains  only  relevant  synchronization  operations.  This  graph  is  then  used  to  create  a 
concurrency  history  graph.  They  use  the  concurrency  history  graph  to  detect  "synchro- 
nization anomalies"  (states  that  lead  to  progrjim  deadlock)  as  well  as  access  anomailies. 

The  third  major  body  of  work  in  this  area  is  by  Emrath  Jind  Padua.  In  [EP88],  they 
present  some  preliminairy  ideas  on  the  detection  of  nondeterminism  in  parzillel  prograims. 
They  define  a  ordering  graph  whose  nodes  represent  statements,  jind  whose  arcs  repre- 
sent execution  order  between  statement  instances.  Although  they  do  not  describe  how 
ordering  graphs  aie  created  or  zinzdyzed,  they  present  a  clean  mechamism  for  representing 
coordination  cimong  parzdlel  loop  iterates. 

Each  coordination  edge  has  an.  associated  distance  vector  which  is  ainnotated  with  the 
distzince  between  the  two  coordinating  threads.  This  allows  for  a  compact  representation 
of  the  coordination  among  parcJlel  loop  iterates.  Distance  vectors  are  similar  to  the 
direction  vectors  which  cire  used  to  represent  data  dependences  in  analysis  of  sequentied 
prograjns.  However,  rather  thcin  simply  containing  <,  >  or  =  information,  an  entry  in  a 
distance  vector  contadns  the  distance-measured  in  nimaber  of  loop  instcinces-of  the  two 
coordinating  paredlel  loop  insteinces. 

PTOOL  is  an  environment  designed  at  Rice  for  developing  and  debugging  both  auto- 
matically parcillelized  and  explicitly  parallel  Fortran  programs  [BKK"'"88,BKK''"89].  One 
aspect  of  the  system  is  a  tool  for  detecting  and  inspecting  potentied  access  anomalies. 
Callahan  and  Subhlok  [CS88,CKS90]  propose  a  synchronized  flow  graph,  based  in  part  on 
the  ideas  proposed  by  Emrath  and  Padua,  to  detect  potenticd  amomadies.  They  focus  on 
event  synchronization  and  use  instance  cind  distance  vectors  anzdyzing  coordination  among 
loop  iterates.  In  contrast  to  the  preceding  systems,  a  more  generad  data  flow  approach  is 
used  which  results  in  less  exact,  but  computationcdly  tractable,  solutions. 

A  second  effort  in  conjunction  with  the  PTOOL  environment  is  by  Bedasundciram  Jind 
Keimedy  [BK89].  They  use  a  co-graph  similar  to  the  ordered  graph  of  Padua  and  Emrath. 
Their  primary  emphasis  is  on  detection  of  Vcirious  types  of  deadlock  conditions,  although 
their  cinalysis  techniques  can  be  used  to  detect  access  anomalies. 


1.1.2  Trace-Based  Anomaly  Detection 

Trace-based  anomaly  detection  is  a  dynamic  technique  that  finds  the  anomalies  that 
occur  in  a  given  execution  instance.  In  trace-based  anomaly  detection,  all  accesses  to 
shared  variables  and  all  parallel  operations  ase  traced  during  program  execution.  After 
the  program  completes,  a  graph  that  represents  the  concurrency  structure  of  the  progrzim 
execution  is  built,  the  concurrency  relationship  is  computed,  and  a  set  of  access  einomalies 
is  reported. 

Choi,  Netzer  and  Miller  proposed  a  trace-based  technique  for  dynamic  access  anomaly 
detection  [MC88,CMN88,Cho89]  as  part  of  a  large  system  for  debugging  peirallel  programs. 
A  before  and  an  after  vector  is  associated  with  each  instruction  sequence  S  which  contcdns 
information  about  which  instruction  sequences  respectively  p  eded  5  and  followed  S 
during  the  execution  of  the  program.  Access  anomalies  are  c  .cted  by  compairing  the 
accesses  made  by  each  instruction  sequence  S  with  the  accesses  by  ail  instruction  sequences 
that  are  concurrent  with  5.  Netzer  and  Miller  [NM89]  developed  methods  for  determining 
whether  or  not  an  access  anomaly  is  a  side-effect  of  anomalies  that  preceded  it  in  the 
execution  insteince.  When  this  is  the  case,  additional  ordering  is  added  to  the  graph  to 
eliminate  this  "feilse"  anomaly. 

Emrath,  Ghosh  and  Padua  [EGP89,EP88]  have  also  developed  a  trace-based  technique 
for  detecting  access  anomcdies.  In  contrast  with  the  system  of  Choi  and  Miller,  they  at- 
tempt to  model  the  ordering  that  is  guaranteed  to  occur  in  similar  execution  instances, 
rather  than  what  actucdly  appeared  in  the  given  execution  instzince.  The  concurrency 
relationship  is  computed  iteratively  by  finding  the  closest  conmion  ancestor  of  all  Ccindi- 
date  coordination  partners.  This  relationship  may  be  incorrect  in  that  two  instruction 
sequences  that  can  never  execute  concurrently  may  be  misidentified  as  being  concurrent. 

1.1.3  On-The-Fly  Anomaly  Detection 

■Similar  to  trace-based  detection,  on-the-fiy  anomaly  detection  finds  the  ainomcilies  that 
occur  within  a  given  execution  instance.  However,  in  on-the-fly  detection  the  progriim  ex- 
ecution is  monitored,  and  anomalies  are  detected  &s  they  occur.  The  primary  advauitages 
of  on-the-fly  detection  over  trace-based  dyniimic  methods  derive  from  data  compression. 
Variable  access  and  parallel  operation  information  that  is  no  longer  needed  is  discarded 
as  the  progTcim  executes.  Moreover,  because  anomalies  aie  detected  during  execution, 
breakpoints  or  exception  handling  routines  can  be  invoked  when  cin  anomaly  is  detected. 
However,  fewer  zmomalies  are  reported  here  than  in  trace-based  anadyses,  and  sophisti- 
cated concurrency  relationships  among  executing  threads  Ccinnot  be  modeled. 

The  Merge  algorithm  [Sch89,Sni88]  is  derived  from  the  following  observation:  if  more 
than  one  concurrent  instruction  sequence  reads  the  s&rae  variable,  these  read  events  may 


be  collapsed  into  a  single  event  after  all  concurrent  instruction  sequences  complete.  To 
implement  the  Merge  algorithm  a  concurrency  list  is  associated  with  one  or  more  in- 
struction sequence,  and  Read  and  Write  sets  are  associated  with  concurrency  lists.  The 
concurrency  list  associated  with  an  instruction  sequence  5  is  the  set  of  tasks  T  such  that 
the  instruction  sequence  currently  executing  in  T  is  concurrent  with  5.  When  an  instruc- 
tion sequence  in  task  T  terminates,  its  Read  and  Write  set  are  compjired  with  cill  Read 
and  Write  sets  associated  with  concurrency  lists  Li  that  contain  T,  and  T  is  deleted  from 
each  Li.  Read  eind  Write  sets  are  merged  whenever  their  associated  concurrency  lists  are 
equcJ.  In  the  worst  case,  the  number  of  concurrency  lists-and  therefore  the  number  of 
Read  and  Write  sets-is  quadratic  in  the  parallelism  of  the  program. 

Nudler  aind  Rudolph  [NR88a,NR88b]  proposed  a  scheme  known  as  English-Hebrew 
labeling.  Read  eind  write  set  information  is  stored  in  an  inverse  format  by  associating  an 
access  information  with  every  shared  variable  X  that  is  checked  whenever  X  is  read  or 
written.  Concurrency  information  is  maintained  by  associating  a  tag  with  each  instruction 
sequence  that  consists  of  a  pair  of  labels:  an  English  label  E  eind  a  Hebrew  label  H.  Two 
instruction  sequences  Si  and  Sj  with  tags  U  eind  tj  are  potentially  concurrent  if  zind  only 
if  the  following  condition  is  met: 

(E{U)  <  Eitj)  and  H(U)  >  H{ij))  or  {E{U)  >  E{tj)  and  H{U)  <  H{tj)) 

English-Hebrew  tags  only  peirtieJly  encode  the  concurrency  relationship.  Therefore, 
each  executing  instruction  sequence  S  has  a  coordination  list  that  contains  the  tags  of 
instruction  sequences  that  occurs  before  S  but  fjiil  the  above  test.  The  complete  test  for 
concurrency  requires  determining  if  the  tag  of  Si  or  any  of  the  tags  in  the  coordination 
list  of  Si  aie  ordered  with  the  tag  of  Sj. 

Hood,  Kennedy  and  Mellor-Crummey  aire  developing  an  on-the-fly  detection  eilgorithm 
for  a  subset  of  the  Pcireillel  Fortrein  language  supported  by  PTOOL  [HKMC89].  In  peirtic- 
xilar,  they  support  nested  pairaUel  constructs  in  which  the  only  vailid  form  of  coordination 
is  doacross  coordination  of  distance  1.  The  goal  of  their  technique  is  to  maintaiin  concur- 
rency information  efficiently  by  tcJdng  advcintage  of  the  regulzir  concxrrrency  relationship 
of  progTcims  in  this  class. 

Concurrency  information  is  maintained  by  associating  a  tag  with  each  instruction 
sequence.  The  tag  of  jm  instruction  sequence  in  iterate  i  of  cin  unnested  normalized  doall 
loop  is  computed  as  follows: 

tag  =  B  +  i  +  k 

where  ife  is  the  nximber  of  doacross  operations  that  have  been  performed,  N  is  the  number 
of  iterates  in  the  doall,  cind  B  is  the  tag  of  the  block  that  performed  to  doall  operation. 
An  instruction  sequence  in  iterate  i  is  a  descendaint  of  a  block  with  tag  tag  if  and  only  if 
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the  following  condition  holds: 

{{tag  -  5  -  1)  div  AT  <  (1  +  k))  and  {{tag  -  B  -  1)  mod  N)  <  {1  +  t) 

For  nested  paraillel  constructs,  a  stack  of  tags  are  maintained  and  checked  at  every  access. 
The  maximum  nimaber  of  instruction  sequences  in  zin  iterate  is  assumed  to  be  known  a 
priori  or  conservatively  estimated  using  static  ainalysis. 

Steele  [Ste90]  proposes  a  technique  for  detecting  anomalies  in  progrsmis  that  do  not 
coordinate.  The  structure  of  the  concurrency  relationship  at  any  point  in  time  is  a  tree; 
this  regularity  is  used  to  determine  whether  or  not  two  instruction  sequences  are  concur- 
rent. When  an  rnst-iction  sequence  5  accesses  a  variable  V,  it  checks  that  all  instruction 
sequences  at  lower  .evels  in  the  concurrency  tree  T  that  also  accessed  V  are  on  the  path 
from  5  to  the  root  of  T.  When  an  instruction  sequence  terminates,  it  checks  all  instruction 
sequences  of  higher  nesting  levels  in  T. 


10 


11 


Chapter  2 

Parallel  Program  Representation 


A  parallel  progrjun  consists  of  a  set  of  parallel  threads  that  cooperate  to  solve  a  problem. 
During  any  given  execution  £  of  a  parallel  prograini,  certain  instruction  sequences  are 
constrained  to  execute  consecutively  due  to  sequential  control  flow  constraints  Jind  the 
semantics  of  the  thread  creation,  termination  and  coordination  operations  performed. 
Other  instruction  sequences,  however,  cein  execute  at  the  same  time.  If  one  instruction 
sequence  Si  is  guaircinteed  to  have  executed  before  another  instruction  sequence  Sj  because 
of  the  parallel  operations  performed  in  an  execution  instaince  E,  then  5,  and  Sj  aie  ordered 
by  the  "happens-before"  partial  order  relation  as  defined  by  Leiinport  [Lam78,Pra86]. 

The  "happens-before"  relationship  of  an  execution  instance  E  is  used  to  detect  access 
anomcilies  in  E.  Specifically,  two  accesses  to  the  same  shared  vciriable  are  anomalous  only 
if  they  are  performed  by  two  instruction  sequences  that  axe  not  ordered  by  this  relation. 
Section  2.1  defines  a  partial  order  execution  graph  (POEG)  that  represents  the  "happens- 
before"  relationship  for  an  execution  instance  of  a  parallel  program.  Section  2.2  describes 
how  the  POEG  is  used  to  determine  concurrency  relationships  eimong  the  instruction 
sequences. 

The  peirtial  ordered  relationship  represented  by  a  POEG  is  based  on  the  semzmtics  of 
the  parcdlel  operations  in  the  program;  it  does  not  reflect  sequential  execution  that  occurs 
as  a  side  effect  of  scheduling  decisions  or  due  to  the  peirallelism  of  the  underlying  machine. 
However,  it  is  cniciad  that  edl  of  the  instruction  sequence  interaction  in  a  given  execution 
instance  is  correctly  modeled  in  the  associated  POEG.  If  the  relationship  modeled  by  the 
POEG  is  incorrect,  the  accuracy  of  anomaly  detection  is  aflected:  fadse  anomedies  can 
be  reported  and/or  access  anomalies  can  be  hidden  in  many  or  all  execution  instances. 
Section  2.3  discusses  how  common  coordination  primitives  axe  accurately  represented  in 
a  POEG. 

In  generjd,  many  different  execution  instJinces  of  the  same  program  and  input  vector 
padr  are  represented  by  the  same  POEG.  However,  two  execution  instances  of  a  given 


program  and  input  vector  pjiir  ciinnot  always  be  mapped  to  the  same  POEG.  When  this 
is  the  case,  anomalies  may  go  imdetected  in  nuiny  execution  instcinces.  This  makes  it 
difficult  to  say  that  a  given  program  eind  input  vector  pair  does  not  have  ciny  access 
anomalies.  Section  2.4  discusses  the  notion  of  reliability  in  dynamic  anomiJy  detection 
and  presents  results  on  the  complexity  of  guarainteeing  that  a  progtcim  wUl  never  have  an 
access  anomaly  when  executed  on  a  given  input  vector. 

2.1      Structure  of  the  Partial  Order  Execution  Graph 

A  partizil  order  execution  graph  represents  all  interaction  aunong  instruction  sequences 
through  pfiredlel  operations.  Vertices  in  a  POEG  are  associated  with  paxcdlel  operations: 
namely,  fork,  join  and  coordination  vertices.  Two  types  of  directed  edges  represent  the 
sequential  and  parallel  control  flow  constraints.  A  block  edge  of  a  POEG  represents  &n. 
instruction  sequence  and  is  delimited  by  parallel  operations;  a  coordination  edge  (denoted 
by  a  dashed  line)  connects  two  coordination  vertices.  Thus,  a  peirticJ  order  execution 
graph  is  represented  by  a  triple:  vertices,  block  edges  ajid  coordination  edges. 

In  paraiUel  programs,  peirallelism  is  achieved  through  a  thread  creation  or  fork  opera- 
tion that  creates  severed  threads  that  execute  concurrently.  Similairly,  a  thread  termina- 
tion or  join  operation  indicates  a  decrease  in  paredlelism.  Fork  aind  join  operations  are 
represented  in  the  POEG  by  fork  and  join  vertices: 

•  A  fork  vertex  has  one  in-edge  (the  block  that  performed  the  fork  operation)  and  / 
out-edges  (the  /  blocks  created  by  the  fork  operation). 

•  A  join  vertex  has  one  out-edge  (the  block  that  follows  the  join  operation)  and  j 
in-edges  (the  j  blocks  terminated  by  the  join  operation). 

The  basic  control  flow  for  a  pairaUel  program  is  defined  by  the  fork  and  join  operations 
as  well  as  the  sequentied  control  flow  within  a  given  block  edge.  The  execution  of  a  block 
that  performs  a  fork  operation  must  precede  the  execution  of  all  of  the  threads  created 
in  the  operation.  The  block  that  executes  eifter  a  join  operation  can  begin  only  after  all 
blocks  terminating  in  the  join  operation  are  complete. 

A  parallel  construct  is  a  closed,  nestable  mechanism  for  creating  parallel  threads  cind 
consists  of  a  fork  and  a  join  operation.  The  semantics  of  a  parallel  construct  reqiures 
that  exactly  those  threads  created  by  the  fork  operation  terminate  at  the  associated  join 
operation.  Examples  of  parallel  constructs  include  doall  aind  parallel  case  constructs.  (A 
doall  construct  creates  a  set  of  n  homogeneous  thread  iterates,  a  parallel  case  construct 
creates  heterogeneous  threads.)  The  POEG  for  a  prograim  that  only  contains  pareJlel 
constructs  is  a  series-peiredlel  graph.  A  series-parallel  POEG  idlows  for  certain  optimiza- 
tion in  representing  and  analyzing  the  graph  during  program  execution.  Unless  otherwise 
stated,  we  assume  that  threads  axe  always  created  using  peiradlel  constructs. 
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Figure  2.1  illustrates  how  a  simple  parallel  program  is  represented  by  a  POEG.  Every 
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Figure  2.1:  POEG  for  Nested  Doall  Program 

doall  operation  h&s  corresponding  fork  and  join  vertices.  Each  increase  in  pjiradlelism  is 
represented  by  a  fork  vertex- /i,  fi,  /a,  f^,  and  /5-and  each  decrease  in  parallelism  is 
represented  by  a  join  vertex-ji,  J2,  ja,  ji,  and  J5.  Because  a  doall  is  a  peiradlel  construct, 
all  threads  created  by  a  fork  vertex  terminate  at  the  associated  join  vertex  (for  example, 
/i  and;i). 

Programmers  use  synchronization  and  coordination  to  control  access  to  shared  data 
or  pass  status  information.  Therefore,  a  POEG  must  also  reflect  the  implicit  and  explicit 
orderings  that  results  from  this  interaction.  As  a  practical  matter,  it  is  not  always  possible 
to  recognize  when  shared  memory  is  used  for  coordination.  For  exzimple,  a  thread  can 
use  a  shared  variable  to  broadcast  information  to  other  threads.  If  this  is  not  identified  as 
a  coordination  operation,  the  concurrent  accesses  to  shared  memory  will  be  improperly 
identified  as  access  anomalies.  More  importantly,  the  resultant  ordering  will  not  be  added 
to  the  POEG.  Only  coordination  that  is  explicitly  specified  by  the  programmer  through 
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the  use  of  identifiable  routines  is  represented  in  a  POEG;  the  complexity  of  correctly 
detecting  implicit  coordination  as  well  as  deducing  the  intended  ordering  is  beyond  the 
scope  of  this  thesis. 

One  type  of  ordering  that  is  represented  in  the  POEG  is  the  explicit  control  flow  order- 
ing that  restdts  from  coordination.  For  exeimple,  the  semantics  of  the  biurier  coordination 
primitive  require  that  a  block  that  executes  aifter  a  barrier  operation  in  one  thread  must 
foUow  the  execution  of  ail  blocks  that  coordinate  in  the  biirrier. 

The  nondeterministic  execution  order  of  certain  coordination  operations  can  result  in 
imp/icif  ordering  zimong  blocks.  For  instzince,  the  semantics  of  criticaJ  section  coordination 
does  not  specify  an  explicit  ordering  among  threads  that  execute  the  critical  sections. 
However,  the  execution  order  of  the  critical  sections  can  affect  subsequent  execution. 
In  addition,  certain  concurrent  accesses  to  shared  variables  may  be  intentioucd;  these 
accesses  are  treated  as  a  special  type  of  shared  memory  coordination.  The  implicit  order 
that  steins  from  coordination  must  be  represented  in  the  POEG  whenever  the  result  of 
program  execution  depends  on  the  order  in  which  the  blocks  coordinate. 

One  could  imagine  coordination  primitives  that  enforce  arbitrary  ordering  constraints 
among  mutually  conctirrent  blocks.  In  genercJ,  however,  the  goal  of  coordination  is  to 
synchronize  the  execution  of  several  blocks  at  some  coordination  point.  Therefore,  we 
restrict  the  notion  of  coordination  within  an  execution  instance  as  follows: 

Coordination  Constraint:  A  coordination  point  is  a  stage  P  during  the 
execution  of  a  program  such  that  one  set  of  blocks  R  can  continue  execution 
only  if  a  second  set  of  blocks  5  has  reached  P. 

Thus,  if  a  block  is  waiting  for  a  set  of  blocks  5  to  reach  some  point  in  their  execution, 
£my  other  block  that  is  waiting  for  einy  block  in  5  must  wait  for  ail  blocks  in  5.  This 
coordination  constraint  bounds  the  complexity  of  algorithms  presented  in  Chapter  3  while 
remaining  general  enough  to  represent  all  common  forms  of  coordination.  (The  represen- 
tations presented  in  Section  2.3  meet  this  constrednt.)  For  the  remainder  of  this  thesis, 
all  coordination  is  assimied  to  meet  this  constrcunt. 

The  implicit  emd  explicit  ordering  that  arises  from  a  coordination  point  is  represented 
in  the  POEG  by  adding  coordination  edges  that  coimect  coordination  vertices,  thereby 
ordering  the  subsequent  execution.  A  coordination  vertex  cam  be  one  of  two  types: 

•  A  sender  vertex  hcis  n  coordination  out-edges  to  associated  receiver  vertices. 

•  A  receiver  vertex  has  n  coordination  in-edges  from  associated  sender  vertices. 

In  addition,  every  coordination  vertex  has  an  in-edge  which  is  the  block  preceding  the 
coordination  operation  and  cin  out-edge  which  is  the  block  following  the  coordination 
operation.  Synchronous  coordination  is  represented  by  a  set  of  sender  and  receiver  vertices 
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in  each  coordinating  thread  (for  an  exiunple  see  Section  2.3.1)  and  is  generadly  shown  in 
POEG  diagrams  as  a  bidirectional  coordination  edge. 

To  illustrate  the  use  of  coordination  edges,  consider  the  prograim  fragment  in  Figure 


doall  t  =  1  to  3 


\fi^l  then  wait(iJ[»]) 


if  t  7^3  then  signal(£'[»  +  1]) 


end  all 


Figure  2.2:  POEG  with  Signal- Wait  Coordiuation 

2.2.  The  semamtics  of  signal-wait  coordination  requires  that  the  execution  of  a  signal 
operation  precede  the  execution  of  the  associated  wait  operation.  This  ordering  is  mod- 
eled in  the  POEG  by  a  coordination  edge  from  a  sender  vertex  that  represents  the  signal 
operation,  to  a  receiver  vertex  that  represents  the  associated  wait  operation.  The  coordi- 
nation edge  denotes  that  all  of  the  execution  that  happened  before  the  signal  operation  is 
gueircknteed  to  have  completed  before  all  of  the  computation  that  followed  the  wait  oper- 
ation. Thus,  the  execution  of  block  6i  must  precede  that  of  block  64  and  the  execution 
of  block  64  must  precede  that  of  block  67.  Because  the  ordering  relationship  is  transitive, 
the  execution  of  block  61  must  also  precede  block  67. 

2.2      Detecting  Concurrency  Using  the  POEG 

The  "happens-before"  pcirtial  order  relation  is  globally  captured  by  the  interblock  rela- 
tionships defined  below: 

Definitions: 

A  block  a  is  an  ancestor  of  a  block  b  (and  block  6  is  a  descendant  of  block  a)  if  and 
only  if  there  is  a  path  from  a  to  6  in  the  POEG. 
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A  block  a  is  a  direct  ancestor  of  a  block  b  if  and  only  if  there  is  a  path  from  a  to  6 
in  the  POEG  that  does  not  include  a  coordination  edge. 

A  block  a  is  an  indirect  ancestor  of  a  block  6  if  and  only  if  all  paths  from  a  to  6  in 
the  POEG  include  a  coordination  edge. 

The  direct  parents  (resp.    indirect  parents)  of  a  block  b  is  the  set  of  closest  direct 
ancestors  (resp.  indirect  ancestors)  of  b. 

Two  blocks  are  concurrent  if  eind  only  if  neither  is  an  aincestor  of  the  other. 

The  maximum  concurrency  of  a  POEG  is  the  maximum  number  of  mutually  con- 
current blocks. 

A  block  a  can  be  an  ancestor  of  a  block  b  either  from  sequential  control  flow  constraints 
(e.g.  because  of  block  edges)  or  due  to  the  ordering  imposed  by  coordination  edges. 

To  illustrate  these  definitions,  consider  the  POEG  in  Figure  2.3.  This  POEG  repre- 
sents a  program  segment  containing  two  nested  pzirallel  constructs  and  an  asynchronous 
coordination  operation  between  blocks  62  and  65.  Block  62  is  the  direct  parent  of  block  (g; 


ni 


Figure  2.3:  POEG  with  Coordination 
block  be  is  the  indirect  pairent  of  bg  because  of  the  coordination  edge  between  blocks  62  and 
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be.  Block  63  is  concurrent  with  blocks  64, 65,  amd  62;  it  is  not  conciirrent  with  block  60  or 
61  (which  are  ancestors),  with  blocks  b^ybj,  or  613  (which  are  direct  descendants),  or  with 
blocks  681^9)^10)^11  or  612  (which  are  indirect  descendants).  The  maYiTrmTn  concurrency 
of  the  POEG  in  Figure  2.3  is  four,  since  blocks  63,  64,  65  and  62  are  mutually  concurrent 
(as  are  blocks  67,  69,  610  and  611). 

Theorem  1  proves  that  this  definition  of  concurrent  execution  agrees  with  the  common 
concept  of  concurrency: 

Theorem  1:  Two  blocks  6^  zind  bj  in  a  POEG  csji  execute  at  the  s£ime  time 
if  and  only  if  bi  is  neither  an  ancestor  nor  a  descendant  of  bj. 

The  advantage  of  a  POEG  representation  is  that  conciirrency  determination  is  indepen- 
dent of  the  number  and  relative  speed  of  processors  executing  the  program.  This  is  because 
the  partial  order  relationship  expressed  by  the  POEG  is  based  on  the  semantics  of  a  pro- 
grcim  rather  than  the  actual  execution  time  of  instruction  sequences.  By  more  accurately 
reflecting  the  causal  ordering  of  the  prograim,  the  significance  of  anomaly  detection  results 
is  increaised,  since  these  results  are  independent  of  the  underlying  architecture.  Moreover, 
this  allows  access  anomalies  in  paredlel  programs  to  be  detected  on  a  uniprocessor. 

Because  coordination  introduces  additional  ordering,  the  maximum  concurrency  of  a 
POEG  with  coordination  edges  may  be  less  than  the  Tna.TiTmiTn  concurrency  of  the  same 
graph  with  coordination  edges  removed.  For  instance,  the  maximum  concurrency  of  the 
POEG  in  Figure  2.3  is  increased  from  four  to  six  if  the  coordination  edge  between  62 
and  6$  is  removed.  Intuitively,  since  adding  coordination  edges  reduces  concurrency,  one 
expects  ainomaly  detection  to  be  more  efficient  for  programs  that  coordinate.  However, 
the  partiad  order  relationships  in  POEG  with  coordination  edges  are  more  difficult  to 
encode  efficiently.  This  added  complexity  is  reflected  in  increased  computational  and  space 
requirements  for  mtinaging  concurrency  information  in  anomaly  detection  algorithms. 

Asynchronous  coordination  introduces  two  additional  problems.  First,  a  POEG  with 
coordination  could  have  a  cycle  of  2irbitr£iry  length.  However,  Theorem  2  shows  that 
anomaly  detection  in  cyclic  graphs  is  hidden  by  the  larger  problem  of  deadlock. 

Theorem  2:    A  set  of  edges  C  form  a  cycle  in  a  POEG  if  £ind  only  if  the 
blocks  in  C  aie  deadlocked. 

Second,  ajiomaly  detection  methods  require  that  the  concurrency  information  of  all  co- 
ordinating blocks  be  available  when  the  coordination  occurs.  Thus,  whenever  a  sender 
block  reaches  an  asynchronous  coordination  point  before  the  associated  receiver  block, 
information  about  the  sender  must  be  saved  until  the  receiver  reaches  the  coordination 
point.  This  leads  to  additional  cost  for  buffering  concurrency  information  for  asynchronous 
coordination. 
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Theorem  1  Two  blocks  6,-  and  bj  in  a  POEG  can  execute  at  the  same  time  if  and  only 
if  bi  is  neither  an  ancestor  nor  a  descendant  of  bj . 

Proof.  Let  createi,  and  ternn,  denote  the  creation  and  termination  times  respectively  of 
block  6  in  a  POEG  £ind  the  relatives  of  b  denote  all  of  the  ancestors  of  b  cind  itself. 
=^  Suppose,  for  the  sake  of  contradiction,  that  there  is  a  block  bi  that  ccin  execute  at  the 
same  time  as  another  block  bj,  and  bi  is  an  ancestor  oibj.  Because  block  bj  is  a  descendant 
of  bi,  bj  can  only  be  created  after  6j  hsis  terminated.  Therefore,  ternn,^  <  createi,  . 
However,  for  two  blocks  to  be  able  to  execute  at  the  same  time,  it  must  be  possible  for 
both  of  their  creation  times  to  precede  either  termination  times.  Therefore,  6,-  cannot 
execute  at  the  same  time  as  bj. 

<=  Suppose,  for  the  sake  of  contradiction,  that  there  are  two  blocks  bi  and  bj  such  that  bi 
is  neither  an  aincestor  nor  descendant  of  6j,  cind  bi  cannot  execute  at  the  same  time  as  bj. 
Let  Oi  and  Oj  be  the  closest  non-common  relatives  of  6,  and  bj  such  that  Oi  cam  execute 
at  the  same  time  as  Oj.  There  must  be  such  a  p£iir  of  blocks:  they  axe  two  children  of  the 
closest  common  cincestor  of  bi  cind  bj.  Consider  three  cases  for  the  relationships  of  6^,  Cj, 
bj  and  aj: 

1.  6,  =  ai  amd  bj  —  aj-.  It  follows  directly  that  bi  can  execute  at  the  same  time  as  bj. 

2.  bi  7^  ai  and  bj  ^  aj:  Block  o^  ceinnot  execute  at  the  same  time  as  any  children  of 
aj  (because  they  axe  cill  closer  relatives  of  6j  thein  aj).  Therefore  termai  <  termaj. 
Likewise,  termaj  <  terniai  and  hence  termaj  =  termaj.  The  only  way  to  enforce 
this  equality  is  if  a,-  cind  aj  synchronize,  which  makes  a,  and  aj  relatives  of  both  6j 
and  bj. 

3.  bi  =  ai  and  bj  ^  aj  (or  vise  versa):  We  know  term},^  <  terruaj.  To  enforce  this, 
bi  (or  one  of  its  descendants)  must  be  a  parent  of  aj  (or  one  of  its  ancestors)  and 
hence  bi  is  an  cincestor  of  bj. 

In  all  cases  we  get  a  contradiction  and  therefore  bi  must  be  able  to  execute  at  the  same 
time  as  bj.  D 

Theorem  2  A  set  of  edges  C  form  a  cycle  in  a  POEG  if  and  only  if  the  blocks  in  C  are 
deadlocked. 

Proof. 

=>  Suppose,  for  the  sake  of  contradiction,  that  there  is  a  POEG  with  edges  from  6i+i 
to  bi  for  1  <  I  <  n,  Jin  edge  from  6i  to  6„  zind  there  exists  some  order  of  execution  in 
which  blocks  6i  ...  6n  all  terminate.  For  1  <  i  <  n,  bi  cainnot  begin  imtil  6,+i  terminates 
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and  therefore  must  terminate  strictly  after  ij+i  terminates;  thus,  terTni,^  >  . . .  >  terrrib^. 
However,  the  edge  from  61  to  bn  requires  terTni,„  >  termb^  which  is  a  contradiction. 
'^=  Suppose,  for  the  sake  of  contradiction,  that  some  blocks  are  deadlocked  and  there 
is  no  cycle  in  the  POEG.  There  must  exist  at  least  one  block  bi  in  the  deadlock  whose 
ancestors  6^,  . .  .6,^  are  not  in  the  deadlock  (the  root  block  can  always  execute).  At  some 
point  blocks  6,j  . .  .bi-  wUl  terminate  (since  they  are  not  deadlocked)  and  bi  can  terminate. 
Hence  bi  could  not  have  been  deadlocked.  D 

2.3      Representing  Coordination  Operations 

It  is  essenticd  that  the  ordering  that  results  from  the  execution  of  a  parallel  operation  is 
correctly  represented  in  a  POEG.  If  the  ordering  represented  by  the  POEG  is  incorrect, 
the  accuracy  of  £inomaly  detection  can  be  affected  in  two  ways.  First,  if  the  order  described 
by  a  POEG  is  a  subset  of  the  particd  order  relationships  intended  by  the  programmer, 
two  accesses  that  cannot  execute  concurrently  may  be  incorrectly  identified  as  ein  access 
anomaly.  Second,  if  the  ordering  eimong  instruction  sequences  described  by  a  POEG  is  a 
superset  of  the  intended  pcirtial  order  relationship,  access  anomfilies  may  go  undetected. 
There  are  many  different  coordination  primitives,  cind  a  POEG  representation  must 
be  individuailly  designed  for  each  primitive  [Din89],  [AS83].  The  primary  difficulty  in 
representing  a  coordination  primitive  lies  in  correctly  capturing  its  effect  on  progTcim 
execution.  Some  primitives  have  complex  ordering  semantics.  Moreover,  all  of  the  relevant 
information  may  not  be  available  at  the  time  that  certciin  blocks  reach  a  coordination 
point.  This  section  describes  how  several  common  coordination  primitives-namely,  barrier 
coordination,  critical  sections,  message  passing,  doacross  coordination,  Ada  rendezvous 
and  events-cire  represented  in  the  POEG.  The  techniques  used  here  can  be  applied  to 
other  coordination  primitives. 

2.3.1      Barrier  Coordination 

An  n-way  barrier  coordination  synchronizes  the  execution  of  n  threads.  Semantics  of  a 
bairrier  are  that  each  thread  executing  a  barrier  must  wait  until  ail  threads  have  performed 
the  barrier,  at  which  point  all  may  continue  their  execution  [Axe86].  It  is  incorrect  for 
either  n  —  1  or  n  +  1  concurrent  threads  to  execute  the  same  n-way  barrier.  Beirriers  are 
commonly  used  in  parallel  programs  that  execute  in  phcises.  All  threads  execute  a  beirrier 
at  the  end  of  phase  i,  thereby  guaxeinteeing  that  all  processing  of  phase  i  is  complete 
before  any  processing  of  phase  i  +  1  is  begun. 

Barrier  synchronization  among  threads  ti  . .  .this  represented  in  the  POEG  by  2x(6-l) 
coordination  edges.  Each  thread  fj  is  modeled  as  performing  a  sequence  of  four  steps: 

1.  Asynchronously  coordinating  with  thread  <j_i  as  a  receiver 
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2.  Asynchronously  coordinating  with  thread  f^+i  as  a  sender 

3.  Asynchronously  coordinating  with  thread  U+i  as  a  receiver 

4.  Asynchronously  coordinating  with  thread  ti_i  as  a  sender 

Thread  ^i  does  not  perform  steps  1  and  4,  and  thread  tb  does  not  perform  steps  2  eind  3. 
Similar  techniques  are  used  to  represent  ciny  form  of  synchronous  coordination. 

To  Ulustrate  this  representation,  Figure  2.4  shows  the  POEG  that  results  when  a  4- 


doall  t  :=  1  to  4 

barrier(&) 
endall 


Fig\rre  2.4:  POEG  with  Bjurier  Coordination 

way  barrier  is  executed  by  blocks  6i,  62,  63  and  64.  As  a  resiilt  of  the  bcirrier  operation, 
all  blocks  in  phase  I-61,  621  ^3  and  64-Jire  ancestors  of  all  blocks  in  phase  2-6$,  be,  bj  aind 
bs. 

2.3.2      Message  Passing  Coordination 

In  message  passing  coordination,  one  block  sends  a  message  that  another  block  receives. 
Message  passing  is  most  often  used  in  distributed  memory  progrcims.  However,  its  seman- 
tics is  general  enough  so  that  message  passing  Ccin  be  used  to  model  many  other  coor- 
dination primitives.  For  instance,  signal-wait  coordination  corresponds  to  asynchronous 
message  parsing  in  which  all  messages  are  nvdl. 
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The  coordination  that  results  from  the  exchange  of  a  message  between  a  sender  thread 
ts  and  a  receiver  thread  Ir  depends  on  the  type  of  message  paissing.  In  asynchronous 
message  passing,  the  sending  block  does  not  wait  for  the  receiving  thread  to  actually 
receive  the  message.  The  sender  may  proceed  as  soon  as  it  buffers  the  message  for  the 
receiver.  However,  the  receiver  must  wait  for  the  message  to  cirrive.  Thus,  eisynchronous 
message  passing  is  represented  by  a  sender  vertex  in  ts  connected  to  a  receiver  vertex  in 
tR. 

In  synchronous  message  passing,  the  sender  and  receiver  threads  must  weiit  for  each 
other  before  a  message  can  be  passed  between  them.  Therefore,  both  threads  must  be 
at  the  coordination  point  at  the  same  time.  This  is  represented  in  the  POEG  as  a 
sender  vertex  in  ts  connected  to  a  receiver  vertex  in  tji,  followed  by  a  sender  vertex 
in  tji  connected  to  a  receiver  vertex  in  ts- 

Figure  2.5  shows  the  POEG  representation  for  two  programs  with  asynchronous  and 
synchronous  message  passing.  In  both  programs  bloclc  bi  sends  a  message  to  block  62-  ^ 


parallel  case 
send(2) 

parallel 

receive{l) 

end  parallel 


(a) 


(b) 


Figure  2.5:  POEG  with  (a)  Asynchronous  and  (b)  Synchronous  Message  Passing 

the  c&se  of  asynchronous  message  passing,  block  63  can  proceed  before  block  62  reaches 
the  coordination  point.  Therefore,  block  63  is  concurrent  with  both  blocks  62  and  64. 
Block  64  must  wait  for  block  bi  to  send  the  message,  and  hence  it  is  a  descendent  of  b^. 
For  synchronous  message  passing,  both  blocks  63  and  64  can  proceed  only  sifter  both  61 
and  62  have  reached  the  coordination  point.  Therefore,  ij  and  62  ar«  ancestors  of  both  63 
and  64. 
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2.3.3      Doacross  Coordination 

Ooacross  coordination  is  a  structured  form  of  asynchronous  message  peissing  used  in  doall 
constructs  [Cyt84].  In  doacross  coordination,  a  thread  tj  must  wait  at  some  point  w  in  its 
execution  until  another  thread  t^.j  reaches  a  point  s  in  its  execution,  where  d  is  a  iixed 
iterate  distance.  This  is  modeled  in  a  POEG  as  thread  ti-d  asynchronously  sending  a 
message  to  thread  fj. 

The  use  of  doacross  coordination  leads  to  a  pipelined  style  of  execution.   Figure  2.6 
shows  a  progrjim  with  a  doacross  with  distance  2.  Each  iterate  U  receives  a  message  from 


doall  (  :=  1  to  6 

receive(t) 

A[{[  :=  f{A[i  -  2]) 
send(»  +  2) 

endall 


Figure  2.6:  POEG  with  Doacross  Coordination 

iterate  fi_2  and  then  uses  the  entry  of  A  that  was  computed  by  that  iterate.  As  a  result 
of  the  doacross  coordination,  blocks  62  and  64  are  indirect  ancestors  of  block  612  (blocks 
60  and  be  are  direct  ancestors  of  block  612)-  Block  612  is  not  ordered  with  any  block  in 
iterates  1,  3  and  5. 

2.3.4     Ada  Rendezvous 

The  Ada  rendezvous  is  a  special  type  of  asynchronous  message  passing  known  as  a  remote 
procedure  call  [Sta83].  In  an  Ada  rendezvous,  a  calling  block  requests  a  called  block  to 
perform  some  remote  action.  The  calling  block  immediately  waits  for  the  called  block 
to  send  the  result  of  the  action.  After  the  called  block  completes  the  remote  action,  it 
rettirns  the  result  to  the  calling  block  or  indicates  that  the  request  has  been  completed 
(if  the  procedure  does  not  have  amy  results). 
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An  Ada  rendezvous  is  composed  of  a  pair  of  message  exchanges.  The  first  message 
is  sent  from  the  calling  thread  to  the  cailled  thread,  requesting  it  to  execute  a  remote 
procedure.  The  second  message  is  sent  from  the  called  thread  to  the  calling  thread  after 
the  execution  of  the  remote  procedure  and  contadns  the  result  of  the  execution.  These 
messages  are  represented  by  two  asynchronous  coordination  edges.  Because  the  calling 
thread  begins  waiting  for  the  result  inmiediately  after  it  miikes  the  request,  there  is 
conceptually  a  single  coordination  vertex  in  the  calling  block. 

Figure  2.7  shows  a  rendezvous  between  two  parallel  threads.  Calling  block  bi  in  task 


Task  A 


Task  B 


bi 


t 


&2 


^ 


^ 


64 


Task  A: 

B.dequeiie(buf) 

Task  B: 

accept  dequeuefbuf  :  in  out  buf-type)  do 

process  dequeue 
end  dequeue 


Figure  2.7:  POEG  with  Ada  Rendezvous 

A  requests  the  called  block  62  in  task  B  to  perform  a  dequeue  operation.  This  results 
in  the  first  coordination  edge  in  the  POEG  from  block  61  to  62.  Block  63  performs  the 
dequeue  operation  and  sends  the  result  to  block  61  (via  the  in  out  parameter  buf).  This 
creates  the  second  coordination  edge  from  block  63  to  61, 

2.3.5      Critical  Section  Coordination 

A  critical  sections  is  an  instruction  sequence  that  protects  access  to  shared  vjiriables  by 
guarcinteeing  that  all  critical  sections  are  executed  sequentially  [Dij65,Dij68,Dij71].  Lock 
and  unlock  operations  delimit  criticzJ  sections,  as  illustrated  by  the  progrjim  fragment  in 
Figure  2.8. 

Two  properties  must  be  captured  by  the  POEG  representation  of  critical  section  co- 
ordination: 

1.  Accesses  noade  inside  criticzd  sections  never  conflict. 
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2.  An  access  made  inside  a  criticzd  section  and  an  access  made  outside  a  critical  section 
Ccin  conflict. 


In  addition,  the  impredictable  order  in  which  parallel  threads  execute  criticcd  sections  can 
aifect  the  result  of  execution.  For  example,  in  the  program  fragment  shown  in  Figure  2.8, 
the  lock  operations  ensure  that  no  two  iterates  update  X  at  the  same  time;  however,  the 
order  in  which  updates  occur  is  arbitrairy  and  determines  the  final  value  of  X. 

Critical  section  operations  are  modeled  in  the  POEG  by  treating  the  unlock  operation 
as  a  send  operation  and  the  lock  operation  as  a  receive  operation.  This  resvdts  in  a 
coordination  edge  from  the  block  that  performed  the  t"*  unlock  operation  to  the  block 
that  performs  the  i  +  1*'  lock  operation;  this  edge  orders  all  accesses  made  in  the  i"* 
cind  i  +  1*'  critical  section  executions.  Because  all  criticid  sections  aie  ordered  in  the 
POEG,  no  cinomalies  will  be  incorrectly  reported  aimong  accesses  performed  within  critical 
sections.  These  edges  reflect  aU  interaction-and  therefore  all  implicit  ordering-among 
parallel  threads. 

To  illustrate  the  representation  of  criticail  section  coordination,  consider  the  program 
fragment  and  POEG  in  Figure  2.8.  The  execution  instance  illustrated  by  the  POEG  in 
Figure  2.8  is  one  in  which  iterates  ij,  13  find  12  execute  the  criticcd  section,  in  that  order. 
The  coordination  edge  between  ij  eind  13  prevents  the  write  accesses  to  X  in  the  critical 
section  by  iterates  I'l  and  is  from  being  incorrectly  identified  as  access  anomadies.  Since 
all  implicit  ordering  is  represented,  the  accesses  to  B  are  not  falsely  reported  as  access 
Jinomalies.  (Control  to  the  entries  in  array  B  is  passed  through  the  criticail  section.) 

Anomedies  can  be  masked  when  the  criticed  sections  execute  in  a  given  order.  Thus, 
m^ultiple  execution  instances  must  be  einedyzed  to  find  all  access  anomalies.  In  particulair, 
Nl  POEGs  are  required  to  model  the  different  execution  orders  of  N  concurrent  criticed 
sections.  For  example,  there  aie  six  possible  execution  orders  of  the  critical  section  in  the 
program  in  Figure  2.8.  In  the  execution  instance  shown,  the  read  and  write  events  to  A[l] 
are  also  ordered  by  the  coordination  edges  added  to  the  POEG  and  hence  no  write-read 
access  anomaly  wUl  be  detected  in  this  execution  instance.  However,  these  accesses  to 
A  will  be  correctly  identified  as  an  access  anomedy  in  execution  instainces  in  which  13 
enters  the  criticad  section  before  t'l.  Chapter  5  addresses  some  of  the  weadcnesses  of  this 
representation. 

2.3.6      Event  Coordination 

Events  axe  a  type  of  coordination  used  in  parallel  FortrJin  programs  for  interthread  sig- 
naling. An  event  e  has  two  possible  states,  posted  and  cleared,  amd  three  operations  can 
be  performed  on  an  event  e: 
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:\:  :=o 

doall  t:=l,3 
A[{\  :=  ... 
B[i]  :=  ... 
lock(i;) 
j:=X 
X:=i 
unlock(£) 
...:=A[1] 
...  :=  B[j] 

endall 


Figtu-e  2.8:  POEG  with  Three  Thread  Critical  Section  Coordination 

•  post(e)  -  The  state  of  e  becomes  posted. 

•  clear(e)  -  The  state  of  e  becomes  clccired. 

•  wait(e)  -  If  the  state  of  e  is  posted  then  the  calling  block  proceeds.  Otherwise  it  is 
suspended  until  the  state  of  e  becomes  posted,  at  which  point  all  waiting  threads 
may  proceed. 

In  contrast  to  message  passing,  no  data  is  conveyed  by  an  event.  Therefore,  events  are 
used  primjirily  for  conveying  condition  statuses.  Events  have  been  implemented  for  several 
parallel  Fortran  systems  such  as  Cediir  Fortran  [GPJL88]. 

The  primeiry  difficulty  in  representing  event  coordination  stems  from  the  fact  that 
there  is  not  a  one-to-one  correspondence  between  post  and  wziit  operations.  Several  post 
operations  can  precede  a  given  wiiit  operation.  Emrath,  Ghosh  aind  Padua  present  a 
method  for  representing  event  synchronization  on  a  task  graph  (roughly  equivalent  to  a 
POEG)  that  is  generated  from  an  execution  trace  during  post-mortem  analysis  [EGP89]. 
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Each  wjiit  operation  w  has  a  trigger  set  that  consists  of  the  post  operations  that  can 
possibly  signad  w.  Post  operations  performed  by  block  p  cannot  trigger  a  wait  operation 
performed  by  block  w  if  any  of  the  following  conditions  hold: 

1.  Block  p  is  a  descendant  of  w. 

2.  Block  p  has  a  descendant  block  b  such  that  6  performs  a  clear  or  post  operation  and 
6  is  an  ancestor  of  w. 

The  indirect  parents  of  a  Wciit  operation  w  are  those  blocks  that  are  ancestors  of  every 
block  in  the  trigger  set  oiw;  these  closest  common  aincestors  cire  the  blocks  whose  execution 
is  guaranteed  to  proceed  w.  An  iterative  algorithm  is  given  for  finding  the  trigger  set  and 
its  closest  common  ancestors.  A  similar  technique  is  used  in  static  aneilysis  of  peuTcdlel 
programs  [CS88,CKS90]. 

To  illustrate  event  coordination  aind  trigger  sets,  consider  Figiu-e  2.9  which  shows  a 
POEG  containing  various  post,  clear  £ind  wait  operations.  The  trigger  set  for  the  wait 
operation  performed  by  block  62  is  {64,  65};  it  does  not  contain  the  post  operation  per- 
formed by  67  (since  67  is  a  descendcint  of  62)  or  60  (since  61  performs  a  clear  operation 
and  is  a  descendant  of  60  a^id  «^  cincestor  of  62)-  The  set  of  closest  common  ancestors 
of  the  trigger  set  {64,  65}  is  {63}.  This  means  that  cin  asynchronous  coordination  edge  is 
added  from  block  63  to  block  62  to  capture  the  ordering  resulting  from  the  wait  operation 
performed  by  62- 

The  on-the-fly  computation  of  the  trigger  set  and  closest  common  aincestors^  differs 
from  the  post-mortem  computation  in  several  ways: 

1.  The  entire  POEG  is  not  avaulable  when  coordination  edges  aie  added.  In  pcirticular, 
when  a  wait  operation  is  performed  there  may  be  post  operations  that  should  be 
included  in  its  trigger  set  that  have  not  yet  executed.  Therefore,  the  set  of  closest 
common  eincestors  may  have  to  be  calculated  from  a  p£irtiad  trigger  set. 

2.  Whenever  a  clear  operation  is  performed,  the  current  trigger  set  is  deleted.  However, 
a  clear  operation  does  not  affect  future  {and  possibly  concurrent)  post  operations. 

3.  The  on-the-fly  representation  is  not  iterative;  once  an  ancestor  set  is  cfdctdated  it 
is  not  modified  later  during  prograim  execution. 

The  on-the-fly  technique  uses  information  about  the  actued  order  in  which  post  and  wziit 
operations  aie  performed  during  an  execution  instance.  (The  post-mortem  approach,  in 


'The  set  of  closest  common  ancestors  is  computed  from  the  ancestor  information  used  in  task  recycling 
(described  in  the  Chapter  3)  by  creating  an  intersection  parent  vector  whose  t"*  entry  is  the  minimum  of 
the  t'    entries  of  the  parent  vectors  of  the  blocks  in  the  trigger  set. 

26 


bo 


6  post 


clear  6 


wa 


Figure  2.9:  POEG  with  Event  Coordination 

contrast,  attempts  to  compute  a  more  global  ordering  relationship.)  Because  of  this,  the 
on-the-fly  computation  may  add  more  ordering  than  intended  by  the  programmer,  and 
iinomalies  can  be  missed  in  a  given  execution  instance.  However,  Theorem  3  proves  that 
safe  accesses  will  not  be  incorrectly  reported  as  anomalies. 

Theorem  3:  The  on-the-fly  representation  for  event  coordination  guarantees 
that  two  blocks  that  cannot  execute  concurrently  will  be  ordered  in  the  POEG. 

This  property  does  not  hold  for  the  post-mortem  algorithm. 

To  illustrate  the  difference  between  the  two  techniques,  consider  the  POEG  in  Figure 
2.10.  In  the  on-the-fly  technique,  there  are  several  possible  trigger  sets  of  the  wait  opera- 
tions performed  by  blocks  bi  and  ij,  depending  on  the  actual  execution  order  of  the  post 
and  wait  operations: 

1.  If  6i  completes  before  62  then  T^j  =  {66}  and  an  edge  is  added  from  fce 

to  61. 
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Figure  2.10:  POEG  with  Event  Coordination 

There  axe  severed  different  trigger  sets  for  62- 

(a)  Tb,  —  {63}  :  An  edge  is  added  from  63  to  62- 

(b)  Tb,  =  {63,  64}  :  An  edge  is  added  from  63  to  62.  (The  edge  from  b^ 
to  61  has  already  been  added  mciking  63  an  amcestor  of  64.) 

2.  K  62  completes  before  61  then  Tb,  =  {63}  and  ein  edge  is  added  from  63 
to  62- 
There  aire  severed  different  trigger  sets  for  61: 

(a)  Tij  =  {ig}  :  An  edge  is  added  from  65  to  61. 

(b)  Tfcj  =  {65}  :  An  edge  is  added  from  65  to  61. 

(c)  Tfcj  =  {65,  6$}  :  An  edge  is  added  from  63  to  61.  (The  edge  from  63 
to  62  has  already  been  added  making  63  an  ancestor  of  65.) 

The  three  possible  POEG's  are  shown  in  Figure  2.11.  In  all  cases  the  read  of  X  by  64  and 
the  write  of  X  by  63  are  ordered,  and  hence  no  cinomaly  wUl  be  reported. 

In  contrast,  the  post-mortem  algorithm  adds  a  (redundeint)  coordination  edge  from 
block  60  to  block  62  for  the  wait  operation  performed  by  block  62  since  60  is  the  closest 
common  ancestor  of  blocks  {63,  64}.  Likewise,  the  wait  operation  performed  by  block  61 
results  in  an  edge  from  block  60  to  block  61.  (Block  60  is  cdso  the  closest  common  aincestor 
of  {66)  ^5}-)  Therefore,  the  write  to  X  performed  by  63  and  the  read  of  X  by  64  will  be 
incorrectly  identified  as  £in  anomcdy. 

Theorem  3   The  on-the-fly  representation  for  event  coordination  guarantees  that  two 
blocks  that  cannot  execute  concurrently  will  be  ordered  in  the  POEG. 


28 


:=X 


la,    lb  and  2a 


2b 


2c 


FigTire  2.11:  Three  possible  POEGs  with  Coordination  Edges 

Proof.  Suppose,  for  the  sake  of  contradiction,  two  blocks  eire  incorrectly  modeled  as 
being  concurrent  in  the  on-the-fly  representation  of  some  execution  instcince  E.  Let  a  and 
w  be  the  first  blocks  such  that  a  must  execute  before  wait  operation  u;  in  all  execution 
instances,  but  a  and  w  are  concurrent  in  E,  and  let  T^  denote  the  trigger  set  of  w. 

Since  w  was  eventually  posted  in  E,  we  know  that  T„  ^  0.  Block  a  must  be  concurrent 
with  or  a  descendant  of  at  least  one  block  in  T^,:  if  a  were  zin  ancestor  of  ail  blocks  in  T^,, 
a  would  be  a  common  ancestor  of  T^  and  therefore  be  ordered  with  w  in  E.  Moreover,  a 
must  not  be  an  ancestor  of  w. 

Let  E'  be  an  execution  instance  that  is  identiccd  to  E  except  that  the  trigger  set  of  w 
in  E'  consists  of  T^  —  Ta  where  Ta  contains  all  members  of  T^,  that  are  descendants  of  a. 
E'  is  a  valid  execution  instance  since  no  block  in  To  is  an  ancestor  of  w,  a  or  any  member 
of  Tu,  -  Ta-  (If  a  block  in  Ta  was  an  ancestor  of  w  then  a  would  also  be  an  ancestor  of 
w.)  Block  a  is  either  concurrent  with  or  a  descendjint  of  every  block  that  could  possibly 
have  triggered  w  in  E'.  Therefore,  w  can  execute  at  the  same  time  as  a  in  E'  and  hence 
a  contradiction  has  been  reached.  D 

2.4      Reliability  of  Dynamic  Access  Anomaly  De- 
tection 

The  reliable  detection  of  access  «inomjilies  is  based  on  two  properties: 
Reliability  Properties: 

1.  A  program  with  no  £Lnomalies  is  not  reported  falsely  as  having  anomalies. 

2.  A  program  with  amomalies  is  reported  as  having  anomalies. 
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It  is  not  difficult  to  guairantee  Reliability  Property  1.  To  do  so,  a  POEG  representa- 
tion must  guarantee  that  two  accesses  that  are  ordered  in  every  execution  instaince  are 
not  incorrectly  reported  as  an  access  anomeJy.  AH  representations  for  the  coordination 
primitives  described  in  Section  2.3  meet  Reliability  Property  1.  This  can  be  seen  eas- 
ily for  barrier,  message  passing,  doacross,  rendezvous,  and  criticcil  section  coordination; 
Theorem  3  proves  it  explicitly  for  event  coordination. 

However,  guaranteeing  Reliability  Property  2  is  more  difficult.  The  primairy  weakness 
of  the  dynamic  access  anomcdy  detection  approach  is  that  it  only  conveys  information 
about  a  single  execution  instance  of  a  program;  therefore,  it  is  a  priori  limited  to  proving 
properties  about  a  given  input  vector^.  The  problem  of  detecting  access  anomalies  in  a 
global  sense  for  a  prograim  can  be  addressed  only  by  static  analysis  techniques. 

Even  when  the  input  vector  is  fixed,  the  absence  of  access  anomalies  in  one  execution 
instance  of  a  prograim  does  not  necesscirily  indicate  that  the  program  will  never  have  an 
access  anomaly.  We  say  that  a  program  £ind  input  vector  pair  (P,  I)  is  anomaly  free  if  eind 
only  if  no  execution  instance  of  P  on  /  contains  an  access  anomcily.  K  we  are  unable  to 
prove  that  a  program  and  input  vector  padr  is  anomaly  free,  we  cein  have  Uttle  confidence 
in  the  correctness  of  the  progreim.  The  difficulty  of  guaxcinteeing  freedom  from  cinomzilies 
lies  in  the  nxunber  of  execution  instances  that  miist  be  anailyzed. 

Once  an  input  vector  is  defined,  there  are  only  two  sources  of  nondeterminism  in 
a  pairallel  program.  One  source  of  nondeterminism  stems  from  access  anomalies  which 
can  modify  the  subsequent  execution  of  the  progreim.  Since  our  goal  is  to  guarantee 
that  the  program  cind  input  vector  pair  is  anomaly  free,  detecting  the  first  occurrence 
of  am  anomaly  is  sufficient.  The  second  source  of  nondeterminism  stems  from  certain 
coordination  operations;  for  exjimple,  critical  section  coordination.  Thus,  even  in  the 
absence  of  access  Jinomalies  there  can  be  mainy  different  execution  instances  for  a  given 
program  ajid  input  vector  pair.  While  progrcuns  would  be  easier  to  write  and  debug  if 
this  nondeterminism  were  outlawed,  nondeterminism  is  in  fact  useful  for  a  wide  variety 
of  applications  and  thus,  necessary. 

We  distingxiish  between  three  types  of  nondeterministic  behavior  that  can  be  foimd  in 
parallel  progrcims:  internail  and  external  nondeterminism  (defined  by  Emrath  and  Padua 
[EP88])  and  POEG  nondeterminism. 

Definitions: 

Internal  Determinism,  :  The  sequence  of  instructions  performed  by  each  thread,  as 
well  as  the  variables  used  by  those  instructions,  is  deterministic. 


An  input  vector  contains  all  external  input  to  the  program  including,  for  example,  values  returned 
&om  a  random  number  generator,  timing  routines,  or  interrupts. 
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External  Determinism :  The  output  of  the  progrzun  is  detenmnistic,  but  the  program 
in  internally  nondetenninistic. 

POEG  Determinism  :  The  structure  of  the  POEG  is  deterministic. 

Most  paredlel  programs  eire  extemjJly  deterministic. 

There  is  a  subtle  relationship  between  interned  nondeterminism  and  POEG  nondeter- 
minism.  By  definition  cind  construction,  all  nondeterminism  resulting  from  coordination 
operations  is  explicitly  modeled  in  the  POEG:  the  only  difference  in  two  execution  in- 
stances of  a  progTeim  amd  input  vector  pair  that  map  to  the  same  POEG  derives  from 
access  anomalies.  In  the  absence  of  access  anomalies,  each  individued  block  in  a  POEG  is 
internally  deterministic.  If  the  nondeterminism  arising  from  varying  interthread  coordina- 
tion were  not  represented  in  the  POEG,  then  the  program  woiild  be  POEG  deterministic, 
but  each  block  could  be  intemaJly  nondeterministic. 

In  software  testing  theory,  [GG77,WO80],  a  set  of  inputs  T  is  chosen  to  satisfy  a  data 
selection  criterion  C  where  C  denotes  some  predicate  over  subsets  of  the  domain  Z?  of  all 
possible  inputs.  A  selection  criterion  is  reliable  if  and  only  if  the  program  succeeds  or  fails 
consistently  when  executing  a  test  input  that  satisfies  C.  A  selection  criterion  is  valid  if 
and  only  if  there  exists  a  set  of  inputs  satisfying  C  that  wUl  not  be  executed  correctly  if 
the  program  is  incorrect.  A  set  of  test  inputs  T  is  successful  if  the  program  succeeds  on 
each  member  of  T.  The  fundamental  theorem  of  testing  is  based  on  these  definitions: 

{3c){valid{C)Areliable{C)A{3T  C  D){C{T)Asuccessful{T))  D  successful{D) 

In  other  words,  a  successfully  executed  test  suite  is  equivalent  to  a  direct  proof  of  cor- 
rectness if  the  test  suite  satisfies  a  data  selection  criterion  proved  to  be  both  valid  and 
reliable. 

In  the  context  of  dynamic  access  anomaly  detection,  the  domain  of  test  inputs  is  the 
entire  collection  of  possible  execution  instances  for  a  given  program  eind  input  vector  paiirs 
(since  we  are  interested  in  proving  properties  about  a  given  input  vector).  The  first  result 
about  the  complexity  of  guarainteeing  Reliability  Property  2  is  Theorem  4. 

Theorem  4:  A  vidid  and  reliable  data  selection  criterion  for  guaranteeing  that 
a  program  and  input  vector  pair  (P,  7)  is  ainomaly  free  is  a  set  of  execution 
instances  whose  POEGs  cover  the  entire  range  of  graphs  possible  for  P  and  I. 

It  follows  from  Theorem  4  that  the  nimiber  of  execution  instances  that  must  be  einalyzed 
to  guarantee  that  a  program  and  input  vector  pair  {P,I)  is  anomedy  free  is  bounded  by 
the  number  of  different  POEGs  for  P  eind  /.  Given  some  tool  for  generating  execution 
instances,  theoretically  we  c&n  prove  that  a  prograim  is  anomady  free  for  a  given  input 
vector. 
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Unfortunately,  the  number  graphs  that  exist  for  POEG  nondeterministic  programs 
mzikes  even  this  task  intractable.  Let  Ni  denote  the  nvimber  of  different  execution  orders 
of  the  coordination  operations  for  each  initicd  program  state  of  a  set  of  concurrent  threads 
Cj.  (The  program  state  at  einy  point  in  the  program  execution  is  the  current  Vcdue  of 
all  variables  that  are  later  referenced.)  Each  execution  order  of  Cj  Ccin  generate  a  new 
progrzim  state  at  the  beginning  of  the  subsequent  execution.  Thus,  a  program  that  consists 
of  two  paredlel  constructs,  Ci  followed  by  C2,  has  Ni  X  N2  possible  POEGs.  In  general, 
the  number  of  different  graphs  for  a  progrcun  with  POEG  nondeterminism  is  a  product 
of  the  nimiber  of  possible  sources  of  nondeterminism. 

Fortunately,  not  all  programs  £ire  POEG  nondeterministic.  A  direct  consequence  of 
Theorem  4  is  Corollciry  1. 

Corollary  1:  One  execution  instcince  is  sufficient  to  guarantee  that  a  POEG 
deterministic  progrcim  aoid  input  vector  pair  is  anomaly  free. 

To  apply  Corollary  1,  one  must  be  able  to  classify  progriims  as  POEG  deterministic  and 
POEG  nondeterministic. 

POEG  nondeterminism  arises  when  race  conditions  in  the  execution  order  of  coordi- 
nation operations  affects  the  structure  of  the  POEG.  For  example,  the  order  in  which  a 
set  of  threads  reach  a  barrier  does  not  affect  the  representation  in  the  POEG.  However, 
the  actual  order  in  which  two  threads  execute  a  critical  section  is  explicitly  represented 
in  the  POEG.  Therefore,  a  simple  way  to  detect  if  a  program  is  POEG  nondeterministic 
is  to  consider  the  types  of  coordination  primitives  that  it  contains. 

We  can  distinguish  between  two  types  of  coordination  primitives: 

Definition: 

A  coordination  primitive  is  matched  if  and  only  if  the  edges  required  to  represent  the 
coordination  between  bi  and  bj  in  tiny  POEG  are  identical  in  all  execution  instcinces 
in  which  the  initizJ  program  states  of  bi  and  bj  aie  identical. 

Otherwise,  the  coordination  primitive  is  unmatched.  The  concept  of  matched  coordination 
provides  a  method  for  identifying  POEG  deterministic  programs. 

Theorem  5:  Every  program  whose  only  parallel  operations  are  fork  and  join 
operations  cind  matched  coordination  is  POEG  deterministic. 

Theorem  6:  The  barrier  coordination  primitive,  message  passing  coordina- 
tion with  explicitly  named  senders  amd  receivers,  and  doacross  coordination 
primitive  cire  matched. 

Therefore,  pzirallel  progreims  that  only  conteiin  certaiin  parallel  operations  (nzimely,  fork 
and  join  operations  and  bcirriers,  named  message  passing,  cind  doacross  coordination)  cein 
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be  guaranteed  to  be  anomaly  free  for  a  given  input  vector  simply  by  exzimining  a  single 
execution  instance.  This  subclass  is  important  since  it  includes  many  pareiUel  scientific 
programs. 

Many  conamon  types  of  coordination-including  Ada  rendezvous,  critical  section  coor- 
dination and  events-are  immatched.  In  general,  any  t3rpe  of  guarded  or  unnamed  coor- 
dination is  unmatched.  (If  a  coordination  operation  is  gujirded,  then  whether  or  not  it 
is  performed  depends  on  the  state  of  other  threads  [Dij75];  coordination  is  unnaimed  if 
the  pair  of  coordination  threads  is  not  predefined.)  However,  not  all  progreims  that  con- 
tain unmatched  coordination  are  POEG  nondeterministic.  For  example,  the  execution  of 
many  programs  with  criticaJ  section  coordination  is  independent  of  the  execution  order 
of  the  criticad  sections.  Chapters  5  considers  the  specific  types  of  nondeterminism  that 
can  arise  from  criticed  section  coordination  and  discusses  techniques  for  determining  when 
nondeterminism  is  present  and  how  it  aiffects  the  detection  of  access  anomalies. 

Identifying  a  program  as  POEG  deterministic  simplifies  the  ainalysis  of  the  program 
in  several  additionzil  ways: 

1.  Once  a  POEG  deterministic  program  has  been  proven  to  be  anomaly  free,  the  com- 
plexity of  showing  other  properties  of  the  progreim  is  greatly  simplified.  For  instance, 
since  the  structure  of  the  POEG  is  deterministic,  a  single  execution  instaince  is  suf- 
ficient to  guairaintee  that  the  program  eind  input  vector  pair  will  never  deadlock. 

2.  There  is  no  probe  affect  when  detecting  access  Jinomalies  for  POEG  deterministic 
programs.  The  structure  of  the  POEG-cind  hence  the  concurrency  relationship-is 
unaffected  by  changes  in  the  program  behavior  due  to  the  insertion  of  monitoring 
code. 

3.  POEG  determinism  simplifies  the  task  of  trace- and-replay  debugging  of  parallel 
progriims  (for  an  example  see  the  work  of  Fowler,  LeBlanc  and  Mellor-Cnmimey 
[LMC87,FLMC88]).  In  peirticular,  matched  coordination  operations  do  not  need  to 
be  traced  since  the  same  set  of  coordination  operations  will  occur  in  every  execution 
instance.  If  a  progrjim  is  POEG  deterministic  then  no  coordination  operations  need 
to  be  traced  or  replayed. 

4.  Testing  POEG  deterministic  programs  with  no  access  iinomailies  is  relatively  simple. 
The  formal  model  for  testing  parallel  programs  proposed  by  Weiss  [WeiSS]  is  based 
on  sequentializations  of  a  pjirallel  program.  In  an  anomaly  free  program,  the  POEG 
captures  cdl  interblock  interaction.  Thus,  one  sequentialization  is  sufficient  to  test 
an  anomaly  free  POEG  deterministic  parallel  program  for  a  given  input  vector. 
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This  gives  one  hope  that  dynamic  einomidy  detection  cam  be  combined  with  testing 
methodologies  for  generating  input  vectors  and  other  debugging  techniques  to  guaran- 
tee that  a  POEG  deterministic  program  is  correct  as  well  as  ainomedy  free. 

Lemma  1  In  the  absence  of  access  anomalies,  the  initial  program  state  of  all  blocks  are 
identical  in  all  execution  instances  of  a  program  and  input  vector  pair  that  are  represented 
by  the  same  POEG. 

Proof.  Proof  by  induction  on  the  length  L  of  the  longest  path  from  the  root  of  the  POEG 

to  a  block  b. 

Base  Case  (L  =  0).  The  program  state  of  the  root  block  is  deterministic  since  it  is  the 

initial  input  vector. 

Induction  Hypothesis:   Assum,e  that  the  lemma  holds  for  all  blocks  with  L  <  i.    Prove 

that  for  all  blocks  b  with  L  =  i  the  initial  program  state  of  b  is  identical  in  all  execution 

instances  Ei  that  are  represented  by  the  same  POEG. 

Since  ail  execution  instances  Ei  are  represented  by  the  sanae  POEG,  b  must  have  the 
same  set  of  parent  blocks  in  eill  Ei.  The  initial  program  state  of  cill  pairent  blocks  of  b  axe 
identical  in  cdl  E'j  by  the  induction  hypothesis  (the  longest  path  from  the  root  to  a  pairent 
of  b  is  gujiranteed  to  be  less  than  i).  Since  there  axe  no  access  einomalies,  every  block 
performs  the  same  sequence  of  actions  in  every  execution  instance  and  no  shared  Vciriable 
is  written  by  two  pjirent  blocks.  Therefore,  the  finjil  progrzim  state  of  each  pairent  block 
is  identical  emd  non- conflicting  in  all  execution  instances.  The  initial  progreim  state  of  b 
is  simply  a  union  of  the  final  prograim  states  of  its  parent  blocks  and  hence  must  be  the 
same  in  edl  execution  insteinces.  D 

Theorem  4  A  valid  and  reliable  data  selection  criterion  for  guaranteeing  that  a  program 
and  input  vector  pair  (P,  1)  is  anomaly  free  is  a  set  of  execution  instances  whose  POEGs 
cover  the  entire  range  of  graphs  possible  for  P  and  I. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  some  execution  instance  E  has  aa 
access  anomaly  that  is  not  detected.  E  must  have  the  Scime  POEG  as  some  execution 
instance  E'  that  does  not  have  any  access  anomcJies  since  all  possible  POEGs  for  the 
program  and  input  vector  p2iir  were  aucdyzed. 

In  order  for  there  to  be  an  anomaly  in  E  and  not  in  E',  the  set  of  vairiables  read 
eind  written  by  some  block  must  be  different  in  E  and  E'.  Consider  the  set  of  concurrent 
blocks  B  that  contain  access  euiomalies  in  E  such  that  no  ancestor  of  a  block  in  B  contadns 
access  anomzilies.  By  Lemma  1,  we  know  the  initied  program  states  of  the  blocks  in  B  are 
the  same  in  E  and  E'.  Since  all  sources  of  nondeterminism  are  fixed  for  all  blocks  b  £  B, 
the  actions  performed  by  each  b  aie  identical  before  the  first  ainomcily. 
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We  can  create  a  partial  order  of  the  accesses  in  J9  by  lineaiizing:  (i)  the  accesses  in 
each  block,  Jind  (ii)  the  accesses  to  each  variable.  Let  a  be  an  access  in  E  that  has  a 
different  value  in  E'  such  that  ail  accesses  ordered  before  a  are  identical  in  E  and  E'.  If 
a  is  a  definition,  it  is  a  function  of  a  set  of  uses.  Since  these  uses  are  ordered  before  a 
by  riile  (i),  they  must  have  the  same  values  in  both  execution  instances.  Thus,  the  Vcilue 
computed  for  a  also  must  be  the  same.  If  a  is  a  use,  it  is  reached  by  some  definition,  d. 
Since  d  is  ordered  before  a  by  rule  (ii),  d  must  be  identical  in  both  execution  instances. 
Thus,  the  value  that  reaches  a  must  also  be  the  same.  Hence,  every  access  must  be 
identical  in  E  and  E'  and  therefore  an  anomaly  is  detected  in  E  if  and  only  if  an  anomaly 
is  detected  in  E'.  D 

CoroUfiry  1   One  execution  instance  is  sufficient  to  guarantee  that  a  POEG  deterministic 
program  and  input  vector  pair  is  anomaly  free. 

Proof.  Follows  directly  from  the  definition  of  POEG  determinism  and  Theorem  4.        D 

Theorem  5   Every  program  whose  only  parallel  operations  are  fork  and  join  operations 
and  matched  coordination  is  POEG  deterministic. 

Proof.  Proof  by  induction  on  the  length  L  of  the  longest  path  from  the  root  to  the  block. 
Base  Case  (L  =  0)  The  execution  of  the  root  block  is  deterministic  anA  anomaly  free. 
Induction  Hypothesis:  Assuvfie  that  the  theorem  holds  for  all  blocks  with  L  <  i.    Prove 
that  for  blocks  b  with  L  =  i  that  the  structure  of  the  POEG  is  deterministic  up  to  and 
including  the  creation  of  b. 

By  the  induction  hjrpothesis,  the  structure  of  the  POEG  up  to  and  including  the 
creation  of  the  parent  blocks  of  b  is  deterministic.  Therefore,  by  Theorem  4  the  initial 
program  state  of  the  parent  blocks  of  fe-aind  hence  the  actions  performed  by  the  pairent 
blocks-are  deterministic.  Consider  ail  possible  ways  in  which  b  could  be  created: 

•  Fork:  In  the  absence  of  access  anomfdies,  each  fork  operation  is  a  simple  instruction 
whose  execution  is  independent  of  the  actions  of  concurrent  threads.  Thus,  the 
parent  of  6  creates  the  Scime  children  blocks  in  all  execution  instances. 

•  Join:  Each  join  operation  is  a  simple  instruction  based  on  a  preceding  fork  opera- 
tion. Thus,  the  parents  of  6  all  perform  the  same  join  operation  in  every  execution 
instances. 

•  Matched  Coordination:  By  Lemma  1  the  initiaJ  program  states  of  the  parent  blocks 
of  6  must  be  identicjil  in  every  execution  instance.  Since  aU  coordination  is  matched, 
by  the  definition  of  matched  coordination,  the  execution  of  pairents  of  b  results  in 
the  same  set  of  coordination  edges  in  all  execution  instances. 
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In  all  cases  the  structure  of  the  POEG  is  detenninlstic.  D 

Theorem  6   The  barrier  coordination  primitive,  message  passing  coordination  with  ex- 
plicitly named  senders  and  receivers,  and  doacross  coordination  primitive  are  matched. 

Proof.  We  will  consider  each  coordination  primitive  separately. 

Barrier:  By  definition,  it  is  semantically  incorrect  if  either  only  n  -  1  or  n  +  1 
concurrent  threads  to  perform  a  given  n-way  bzirrier  operation^.  Since  the  execution 
of  the  progTeim  up  to  the  point  of  the  bcirrier  is  deterministic,  the  only  way  to  enforce 
this  correctness  constraint  is  if  the  set  of  threads  executing  a  barrier  operation  is 
fixed  a  priori. 

Message  Passing:  In  the  absence  of  access  anomalies  and  preceding  POEG  nonde- 
terminism,  the  execution  of  each  block  is  deterministic  ajid  hence  each  block  always 
executes  the  same  series  of  named  send  and  receive  operations.  In  pcirticular,  the 
receiver  will  always  receive  the  messages  in  the  same  order  regardless  of  the  order 
they  were  actually  sent.  Therefore,  the  same  set  of  coordinations  will  occur  in  every 
execution  instamce  of  a  given  prograim  and  input  vector  pair. 

Doacross:  This  follows  directly  from  the  above  airgument  since  doacross  coordina- 
tion is  a  restricted  form  of  neimed  message  passing. 

In  ail  cases  the  structure  of  the  POEG  is  deterministic  and  hence  the  coordination  prim- 
itives ane  matched.  □ 


'This  can  be  detected  easily  during  anomaly  detection. 
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Chapter  3 

Task  Recycling  Technique 


This  chapter  presents  the  task  recycling  technique  for  detecting  access  anomalies  dur- 
ing program  execution.  Task  recycling  improves  upon  existing  dynamic  access  anomaly 
detection  techniques  in  three  significant  ways: 

1.  Generality.  The  task  recycling  technique  is  appHcable  to  a  wide  variety  of  thread  cre- 
ation eind  termination  primitives  and  all  common  synchronous  aind  asynchronous  co- 
ordination primitives.  This  is  an  improvement  over  the  methods  of  Hood,  Kennedy 
and  Mellor-Crummey  [HKMC89],  Choi,  Miller  and  Netzer  [Cho89,CMN88,MC88] 
and  Nudler  and  Rudolph  [NR88a,NIl88b]  which  only  support  a  subset  of  the  com- 
mon coordination  and  thread  creation  primitives. 

Task  recycling  has  been  used  to  find  cinomalies  in  progrzims  written  in  a  parallel 
Fortran  diaJect  by  Dinning  and  Schonberg  [DS90]  as  well  as  in  Ada  by  Hind,  Ishbiaih, 
Konany  cind  Schonberg  [SS86,HIKS].  Chapter  4  presents  empirical  meaisurements 
of  the  Fortran  implementation. 

2.  Performance.  Monitoring  costs  are  incurred  at  thread  create,  terminate,  and  coor- 
dinate operations,  and  every  time  a  monitored  variable  is  accessed.  Because  variable 
accesses  are  generally  the  most  frequent  operation,  the  task  recycling  technique  is 
preferred,  since  it  reduces  the  overhead  per  variable  access  to  a  small  constant. 
While  the  worst  case  cost  at  parallel  operation  is  linear  in  the  maximimi  concur- 
rency of  the  graph,  the  actual  cost  depends  on  the  complexity  of  synchronization 
patterns  and  on  the  nesting  structure  of  paradlel  constructs.  Experiments  show  that 
the  parallel  structure  of  a  lairge  class  of  scientific  prograims  is  very  simple.  For  these 
programs,  the  expense  incurred  at  parallel  operations  can  be  significaintly  reduced 
(as  low  as  0(1))  by  using  standard  types  of  data  structure  transformations. 

3.  Storage.  The  task  recycUng  technique  derives  its  nfime  from  the  fact  that  it  reuses 
task  identifiers.     This  enables  task  identifiers  to  be  simple  integers,  so  that  the 


storage  cost  for  concurrency  information  depends  only  on  the  maTimiiTn  concurrency 
of  the  POEG.  In  contrast,  the  number  of  task  identifiers  in  English- Hebrew  labeling 
[NR88a,NR88b]  grows  with  the  length  of  execution,  eind  each  identifier  is  a  complex 
object  maJcing  storage  management  Jin  issue. 

To  detect  anomalies,  the  task  recycling  technique  must  first  determine  which  Vciriables 
aie  accessed  by  each  block.  Section  3.1  describes  how  this  information  is  mauntained 
in  access  histories  associated  with  monitored  variables.  The  access  history  of  a  shcired 
variable  X  stores  the  identifiers  of  the  blocks  that  most  recently  accessed  X.  When  a  new 
block  accesses  X,  the  access  history  of  X  is  compaired  with  the  information  about  the 
block's  concurrency  relationship  to  check  whether  there  is  an  access  anomaly  involving 
X. 

Secondly,  we  must  be  able  to  determine  which  blocks  can  potentially  execute  con- 
currently by  zincilyzing  and  encoding  the  POEG.  The  auicestor  relationships  is  stored  in 
parent  vectors,  as  described  in  Section  3.2.  A  parent  vector  summarizes  the  thread  cre- 
ation, termination,  eind  coordination  history  in  a  compact  manner  that  reduces  the  cost 
of  checking  if  two  blocks  cire  concurrent  to  a  small  constant. 

The  problem  of  assigning  tasks  identifier  labels  to  blocks  is  discussed  in  Sections  3.3 
and  3.4.  The  goal  of  a  task  assignment  algorithm  is  to  minimize  the  number  of  tasks 
used  in  assigning  blocks  by  recycling  tasks:  the  number  of  taisks  determines  the  length  of 
parent  vectors. 

The  techniques  developed  for  detecting  access  anomalies  cam  be  integrated  with  other 
parallel  debugging  tools  to  increase  their  functionality  and  robustness.  Section  3.5  dis- 
cusses extensions  to  two  of  the  most  common  methods  for  debugging  sheired  memory 
parallel  progreims:  trace-and-replay  debuggers  and  event-baised  monitoring.  For  the  re- 
mainder of  this  chapter,  the  s)Tnbols  in  Table  3.1  are  used  to  denote  veirious  parameters. 


Symbol 

Description 

A 

the  niunbei  of  tasks  used  in  a  task  assignment 

B 

the  number  of  blocks  in  the  POEG 

N 

the  maximum  level  of  nesting  of  fork-join  constructs 

T 

the  miiximum  concurrency  of  the  POEG 

r 

the  maximum  concurrency  of  the  POEG  if  coordination  edges 

are  ignored 

V 

the  number  of  monitored  variables 

w 

the  number  of  blocks  created  by  a  fork  operation 

Table  3.1:  Table  of  Symbols 
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3.1      Maintaining  Access  Histories 

One  of  the  primary  drawbacks  of  trace-based  cinoinaly  detection  is  the  potentially  very 
large  storage  reqvurements.  The  amount  of  variable  access  data  is  proportional  to  the 
length  of  the  program  execution  times  the  TnaTimnm  concurrency  of  the  program.  In  on- 
the-fly  detection  of  cinonuilies,  information  that  is  no  longer  needed  cjin  be  discarded  as 
the  progTJim  executes.  In  pjirticular,  task  recycling  compresses  information  about  variable 
accesses  so  that  storage  is  boimded  by  0{V  x  T).  By  reducing  the  amoimt  of  data  stored, 
the  work  required  to  check  each  variable  access  is  also  decreased. 

Although  compaction  of  events  reduces  the  amount  of  variable  access  data,  certain 
information  is  lost.  In  particular,  if  there  aire  multiple  anomalies  involving  the  same 
variable  some  of  them  may  not  be  reported.  The  primary  requirement  in  on-the-fly 
anomaly  detection  is  that  enough  information  is  saved  to  guaraintee  that  at  least  one 
anomaly  is  reported  for  every  variable  which  is  accessed  in  ain  "im^safe"  manner.  This 
level  of  error  reporting  is  genereiUy  considered  to  be  sufficient,  since  any  execution  eifter 
the  first  access  amomady  is  suspect.  A  secondary  goal  is  to  maintain  enough  information 
so  that  the  sequence  of  accesses  to  a  shaired  variable  can  be  reproduced. 

To  eveiluate  Bernstein's  conditions,  the  task  recycling  technique  stores  Read  and  Write 
sets  in  an  inverse  format  by  associating  information  with  each  monitored  vjiriable.  The 
access  history  of  a  variable  X  contains  infonnation  about  the  blocks  which  have  read  and 
written  X.  Although  we  use  the  term  "vjiriable",  access  histories  are  actually  associated 
with  virtual  addresses.  Therefore,  variable  aliasing-one  of  the  primary  weeiknesses  of 
static  anonicdy  detection-is  not  an  issue.  Access  anomalies  can  be  accurately  detected 
even  in  progTcims  with  pointer  and  complex  array  indexing. 

Every  time  variable  X  is  read  or  written,  its  access  history  is  exaimined  to  determine 
whether  the  curreni  read  or  write  event  conflicts  with  a  previous  one.  In  particular,  when 
block  6  reads  X,  block  b  checks  to  see  whether  or  not  it  is  concurrent  with  the  writers  in 
the  access  history  for  X.  When  block  b  writes  X,  b  checks  if  it  is  concurrent  with  any 
of  the  blocks  in  the  access  history  of  JT.  If  either  condition  is  true,  an  access  anomaly  is 
reported.  Block  b  then  adds  its  block  label  to  the  appropriate  set  in  the  access  history. 
The  algorithms  for  checking  for  conflicts  at  read  and  write  events  are  shown  in  Algorithm 
3.1. 

The  efficiency  of  detecting  anomalies  primarily  depends  on:  (1)  how  quickly  we  can 
determine  if  two  blocks  aie  concurrent,  and  (2)  how  many  entries  are  in  an  access  his- 
tory. Therefore,  a  goal  of  on-the-fly  cinomaly  detection  is  to  minimize  the  size  of  access 
histories.  (Section  3.2  discusses  an  efficient  concurrency  check.)  The  notion  of  storing 
access  infonnation  with  monitored  vairiables  was  first  proposed  by  Nudler  and  Rudolph 
[NR88a,NR88b]  who  save  only  the  most  recent  reader  and  the  most  recent  writer.   This 
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procedure  Check-Read(b,  X) 

for  all  a  €  WriteT(X)  do 

if  IsConcuTTent(b,a)  then  report  Access  Anomaly 

end  for 
end  procedure 

procedure  Check- Write (b,  X) 

for  all  a  G  Reader(X)  U  W^Tn<er('W9  do 

if  IsConcuTTent(b,a)  then  report  Access  Anomaly 

end  for 
end  procedure 

Algorithm  3.1:  Check  for  Read  and  Write  Events 

data  is  insufficient  to  guarcintee  that  every  variable  that  is  accessed  in  an  unsafe  manner 
is  detected.  If  all  information  about  past  accesses  is  maintained,  however,  the  size  of  the 
access  histories  will  grow  to  be  equal  to  the  size  of  the  trace  data.  We  would  like  to  store 
only  that  information  required  to  determine  if  a  variable  is  being  accessed  in  an  unscife 
manner  and  delete  all  redundeint  information. 

Consider  the  progrjim  in  Figure  3.1  and  suppose  that  a  variable  X  is  read  by  blocks 
fei,  62?  and  63  and  written  by  block  64  in  that  order.  The  write  of  X  by  64  conflicts  with 
the  reads  of  X  by  61  aind  62-  After  the  first  read,  the  access  history  of  X  contains  61. 
When  X  is  next  read  by  b2,  block  62  is  added  to  the  access  history  and  ftj  is  deleted.  Once 
62  is  added  to  the  reader  set,  the  61  read  event  is  no  longer  needed:  Jiny  subsequent  write 
event  that  conflicts  with  61  also  conflicts  with  62-  On  the  other  hand,  when  63  reads  X 
and  is  added  to  the  access  history,  the  62  read  event  cannot  be  deleted.  Otherwise,  no 
anomaly  can  be  detected  when  X  is  written  by  64,  since  this  write  conflicts  with  the  read 
by  62  but  does  not  conflict  with  the  read  by  63. 

More  generally,  a  block  b  in  the  reader  (or  writer)  set  of  the  access  history  of  a  variable 
X  can  be  subtracted  from  the  access  history  when  a  descendant  of  6  reads  (or  writes)  X. 
Thus,  a  set  in  the  access  history  contains  two  blocks  only  if  they  aire  conciirrent.  In 
addition,  an  entry  in  an  access  history  can  be  subtracted  if  it  conflicts  with  the  current 
access^.  By  deleting  obsolete  entries,  the  size  of  the  reader  set  is  boimded  by  the  mayimum 
concurrency  of  the  program,  T.  Since  two  concurrent  writes  always  conflict,  there  is  at 
most  one  writer  in  an  access  history.  Algorithm  3.2  shows  the  subtraction  algorithms  for 
read  and  write  events. 

The  genereil  subtraction  rule  is  that  an  event  a  is  subtracted  by  an  event  e  only  if 


'  In  oidei  to  maintain  enough  information  to  attain  the  secondary  goal  of  reproducibility,  a  reader  does 
not  delete  a  concurrent  writer  from  an  access  history.  This  deletion  can  be  performed  if  the  only  goal  is 
to  detect  at  least  one  anomaly  for  every  variable  that  is  accessed  in  an  unsafe  manner. 
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Figure  3.1:  Access  History  Exjimple 


every  future  event  that  conflicts  with  a  also  conflicts  with  e.  Therefore,  every  vziriable  X 
:.  .at  is  accessed  in  Jin  unsafe  manner  is  guaranteed  to  be  detected.  In  addition,  enough 
access  data  is  maintfiined  to  achieve  the  second  goal  of  on-the-fly  detection:  specificsJly, 
to  provide  enough  information  about  the  execution  order  of  accesses  so  that  the  sequence 
of  accesses  to  X  can  be  reproduced.  (This  is  proved  in  Section  3.5.) 

The  bound  on  the  size  of  the  reader  set  of  T  is  much  too  pessimistic.  Chapter  4  shows 
that  for  typiczd  numeric  pjiraUel  Fortran  programs,  the  average  access  history  size  is  very 
small  and  independent  of  T.  This  is  due  to  the  common  practice  of  data  partitioning 
to  optimize  performance.  For  niJiny  programs  a  reader  set  with  two  entries  is  sufficient 
for  most  variables  and  thus  the  total  space  for  access  histories  is  0{V).  In  fact,  for 
programs  with  no  nested  pjirallelism  or  coordination,  a  reader  set  of  size  two  is  always 
sufficient  to  detect  all  variables  that  are  accessed  in  an  unsafe  manner.  However,  not 
enough  information  is  maintained  to  reproduce  a  sequence  of  accesses  to  a  variable. 
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procedure  Subtract- Read(b,  X) 
for  all  a  G  Reader(X)  do 

if  not  IsConcurTent(b,a)  then 

Reader (X)  :=  Reader (X)  —  a 
end  if 
end  for 

Reader(X)  :=  Reader(X)  +  b 
end  procedure 

procedure  Subiract-WrHe(b,  X) 
for  all  a  £  Reader(X)  do 

if  not  IsConcurTeni(b ,a)  then 

Reader (X)  :=  Reader (X)  —  a 
end  if 
end  for 

PTnter/'X;  :=  6 
end  procedure 

Algorithm  3.2:  Subtraction  for  Read  cind  Write  Events 

3.2      Detecting  Concurrency 

Concurrency  information  is  used  at  every  variable  access  and  updated  at  every  parallel 
operation.  Because  we  expect  accesses  to  naonitored  vairiables  to  be  more  frequent  than 
parzdlel  operations,  the  task  recycling  technique  is  designed  to  make  the  cost  of  monitoring 
a  variable  access  as  small  a^  possible.  To  do  so,  the  data  structure  for  maintaining  zincestor 
information  minimizes  the  cost  of  concurrency  checking.  There  is  a  small  constaint  amount 
of  work  required  to  determine  if  two  blocks  are  concurrent,  since  this  check  must  be 
performed  for  every  entry  in  the  access  history  of  variable  X  every  time  that  X  is  accessed. 
However,  the  cost  at  fork,  join,  eind  coordination  operations  can  be  relatively  high:  in  the 
worst  case  it  is  proportioned  to  the  maximum  concurrency  of  the  POEG,  T. 

In  order  to  bound  the  space  required  for  maintciining  concurrency  information,  the  task 
recycling  ailgorithm  never  explicitly  builds  or  analyzes  an  entire  POEG  during  execution. 
Instead,  the  concurrency  relationship  inherent  in  the  structure  of  the  POEG  is  computed 
incrementcdly  as  the  program  executes  cind  stored  in  a  compact  format.  Because  we  aire 
monitoring  during  prograim  execution,  a  block  whose  tag  is  in  cin  access  history  that  is  not 
£in  ancestor  of  an  executing  block  b  must  be  concurrent  with  6.  Thus,  only  information 
about  the  ancestor  relationship  is  mciintained. 

The  cincestor  relationship  is  encoded  by  a  unique  task  identifier  label  t„  associated 
with  each  block  that  consists  of  a  task  t  and  a  version  number  v.  More  than  one  block 
may  be  assigned  to  the  same  task  at  different  times  in  the  program  execution.  The  rule 
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for  recycling  tasks  is  that  once  a  block  completes,  its  task  can  be  reaissigned  to  any  of 
its  descendant  blocks.  Therefore,  two  blocks  that  are  successively  assigned  the  same  task 
must  be  ordered  by  the  ancestor  relationship.  The  version  nimiber  of  a  task  identifier  is 
used  to  distinguish  among  different  blocks  assigned  to  the  same  task;  every  time  a  block 
is  assigned  to  a  task  t,  the  associated  version  number  v  is  incremented.  A  block  with  task 
identifier  t,  is  ain  cincestor  of  a  block  with  task  identifier  tj  if  and  only  if  i  <  j. 

Figiue  3.2  shows  a  valid  teisk  assignment  of  a  partial  POEG.  This  assignment  is  valid 


^'"*"*^^^  l*^^"^""'^ 


'V' 


y 


Figure  3.2:  Valid  Task  Assignment 

since  no  task  is  assigned  to  concurrent  blocks.  For  example,  task  3  is  assigned  to  two 
blocks  that  are  ordered  by  a  coordination  edge.  When  task  3  wsis  reassigned  its  version 
number  was  incremented. 

Task  identifiers  contain  very  little  information  about  the  ancestor  relationship.  Ad- 
ditional concurrency  information  is  maintained  in  a  parent  vector  cissociated  with  each 
currently  executing  block.  Data  structures  similar  to  parent  vectors  have  been  used  be- 
fore. For  exeimple,  parent  vectors  correspond  to  the  time  vectors  used  by  Fidge  [Fid88], 
and  the  before  vectors  in  the  post-mortem  anaJysis  technique  of  Choi,  Miller  and  Netzer 
[Cho89,CMN88,MC88]^ 

There  is  an  entry  in  the  parent  vector  associated  with  each  task  t  that  contains  in- 


Bccause  we  monitoi  on-line,  only  a  limited  number  of  patent  vectors  exist  at  any  point  in  the  execution. 
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formation  about  the  ancestors  that  were  assigned  task  t.  Specificcdly,  the  t"*  entry  in  the 
pcirent  vector  for  block  b  stores  the  version  number  of  the  auicestor  of  b  that  was  most 
recently  assigned  task  t  (e.g.  the  ancestor  with  the  largest  version  number).  An  executing 
block  b  can  tell  if  it  is  concurrent  with  a  block  assigned  task  identifier  t„  by  comparing  v 
to  the  <"*  entry  in  the  parent  vector  of  b.  Algorithm  3.3  shows  the  code  for  determining 
if  block  6  is  concurrent  with  task  identifier  <„. 

procedure  IsConcurreniih,  <,) 

if  parent (6)  [<]  <  v  then 
return  true 

else  return  false 
end  procedure 

Algorithm  3.3:  Check  Concurrent 

New  parent  vectors  are  formed  from  the  vectors  associated  with  peirent  blocks  as  in 
[Cho89,CMN88,MC88].  When  one  of  the  parents  of  a  new  block  c  has  task  identifier  f„, 
the  f"*  entry  of  the  parent  vector  of  c  is  v.  Otherwise,  the  t"*  entry  of  the  peirent  vector 
of  c  is  the  TnaTimiim  of  the  f  "*  entry  of  the  peirent  vectors  of  the  pcirents  of  c.  Thus,  when 
blocks  pi  . .  .pm  with  task  identifiers  t^^^  . .  .f„i,„  create  a  new  block  c,  the  parent  vector 
of  c  is  initialized  using  Algorithm  3.4. 

procedure  Build- Parent- VectoT{c,  pi  ...pm) 
for  i  :=  1  to  ^  do 

if  there  exists  pj  £  {pi . .  .pm}  such  that  i  =  tj  then 

parent(c)[t]  :=  Vj 
else  poren<(c)[t]  :=  max(porent(pi)[»],  . . .  ,parent{pm)[i\) 
endfor 
end  procedure 

Algorithm  3.4:  Build  Parent  Vector 

To  illustrate  the  use  of  peirent  vectors,  consider  Table  3.2  which  shows  the  peirent 
vectors  for  the  POEG  in  Figure  3.2.  (While  parent  vectors  for  the  entire  graph  are 
shown,  only  those  for  the  blocks  currently  executing,  I5  emd  24,  exist  at  one  time.)  Two 
emcestors  of  block  24  were  assigned  task  3-namely  3i  euid  32-  Therefore,  the  third  entry  of 
the  peirent  vector  of  block  24  contains  2,  since  this  is  the  version  nimaber  of  the  emcestor 
of  24  that  was  most  recently  assigned  task  3.  The  parent  vector  of  block  22-the  child 
of  the  coordination  edge-contains  2i  eind  I4  as  well  eis  eJl  of  the  eincestors  of  I4.  The 
parent  vector  for  24  is  calcidated  as  the  entry- wise  meiximimi  of  the  parent  vectors  of  its 
peirent  blocks-23,  32,  and  42-and  their  task  identifiers.  Lastly,  we  cein  tell  that  block  Is  is 

44 


Task  Ids 

Parent  Vector 

li 

[0,0,0,0] 

l2,2l 

[1,0,0,0] 

l3,3i,4i 

[2,0,0,0] 

I4 

[3,0,1,1] 

22 

[4,1,1,1] 

23,32,42 

[4,2,1,1] 

15 

[4,0,1,1] 

24 

[4,3,2,2] 

Table  3.2:  Parent  Vector  Table 

concmrent  with  block  2\  since  the  second  entry  in  the  parent  vector  of  I5,  [4,0,1,1],  is  less 
than  1.  However,  block  I5  is  not  concurrent  with  3i  since  the  second  entry  in  the  parent 
vector  of  Is  is  not  less  thzin  1. 

The  primary  benefits  of  using  task  identifiers  amd  parent  vectors  for  encoding  the  con- 
currency relationship  of  the  POEG  are  two-fold.  First,  there  is  a  small  constant  cost  for 
determining  whether  the  executing  block  is  concurrent  with  another  block.  In  particular, 
the  cost  of  telling  if  two  blocks  are  concurrent  is  cin  array  index  and  a  comparison  oper- 
ation. Second,  only  a  small  amoiint  of  information  is  needed  about  a  previous  block  for 
a  currently  executing  block  to  determine  if  they  are  concurrent.  Storing  a  task  identifier 
in  an  access  history  is  sufficient  to  perform  the  concurrency  check  at  a  variable  access. 
Parent  vectors  aire  necessary  only  for  the  blocks  that  are  currently  executing;  parent  vec- 
tors of  terminated  blocks  &ie  discarded.  This  is  an  improvement  over  English- Hebrew 
labeling  in  which  a  complex  object  (string  of  integers)  is  associated  with  each  entry  in  an 
access  history,  and  trace-based  methods  where  aincestor  and  descendant  information  for 
all  blocks  is  maintained. 

The  primziry  drawback  of  task  recycling  is  the  cost  associated  with  maintsdning  parent 
vectors.  The  length  of  each  parent  vector  is  equal  to  the  number  of  taisks  A.  (Section 
3.4  shows  that  j4  is  in  the  worst  case  bounded  by  T',  the  maxinnma  concurrency  of  the 
POEG  if  coordination  edges  were  removed.)  Because  of  the  coordination  constraint,  the 
total  amount  of  work  required  for  peirent  vector  madntenance  is  bounded  by  0{A)  per 
block.  The  nimiber  of  parent  vectors  associated  with  executing  blocks  is  no  more  than 
the  maximtmi  concurrency  so  that  the  total  space  for  these  parent  vectors  is  0(T  X  A). 
In  addition,  parent  vectors  are  associated  with  outstanding  asynchronous  coordination 
operations. 

One  way  to  decrease  the  overhead  associated  with  parent  vector  maintenance  is  by 
performing  a  good  task  assignment;  e.g.   a  task  assignment  that  minimizes  the  number 
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of  tasks  used.  Issues  in  performing  good  task  assignments  axe  discussed  in  Sections  3.3 
and  3.4.  A  second  way  to  decrease  the  costs  is  to  teike  advzintage  of  the  very  simple 
concurrency  structure  of  many  pziredlel  programs. 

Data  structure  transformation  techniques  cam  be  used  to  limit  the  work  and  space 
requirements  of  madntakiing  pairent  vectors  for  progTcim  that  have  simple  concurrency 
structures.  For  excimple,  parallel  Fortran  programs  seldom  have  nested  parcdlelism  or 
coordination  so  that  many  concurrent  blocks  have  the  same  or  similar  ancestor  sets.  By 
performing  data  structure  transformations,  the  worst  case  overheads  associated  with  par- 
ent vector  mcuntenance  are  inctirred  only  in  programs  with  complex  coordination  patterns. 
One  strength  of  the  task  recycling  method  is  that  its  performance  degrades  gracefuUy  when 
these  transformations  are  used;  threads  that  coordinate  pay  an  extra  cost,  but  threads 
that  do  not  coordinate  do  not  perform  any  additionzd  work. 

One  obvious  optimization  is  that  two  blocks  with  the  same  set  of  ancestors  can  share 
a  pairent  vector.  (If  local  memory  is  available  and  the  cost  of  accessing  shared  memory 
is  much  higher  thain  local  memory,  sharing  of  data  structures  may  not  be  cost-effective.) 
For  example,  eill  blocks  created  by  a  doall  operation  have  the  same  ancestors;  a  block 
must  obtain  a  private  parent  vector  only  if  it  terminates  before  the  endall  (by  performing 
a  nested  parallel  construct  or  coordination  operation).  For  a  pairallel  construct  with 
paradlelism  of  W,  the  nvmiber  of  parent  vectors  is  reduced  by  a  factor  of  W.  Thus, 
for  a  program  with  no  nested  parallelism  or  coordination,  only  a  single  peirent  vector  is 
maintained. 

In  addition,  by  a  proper  choice  of  data  structure,  the  parent  vector  computation 
overhead  for  innermost  paredlel  constructs  with  no  coordination  can  be  reduced  to  an 
average  cost  of  0(1)  per  block.  Let  parent  be  the  shzired  parent  vector  of  the  blocks 
executing  after  an  innermost  fork  operation.  At  the  corresponding  join  operation,  rather 
than  creating  a  new  parent  vector  for  the  block  j  executing  after  the  join  operation,  parent 
is  updated  eind  reused  for  j.  When  a  block  with  task  identifier  f„  terminates  in  the  join 
operation,  it  must  update  the  f*''  entry  of  parent  to  be  v.  However,  this  update  cannot 
be  performed  until  aR  of  the  blocks  that  share  parent  have  reached  the  join  operation. 
Instead,  the  identifiers  are  buffered  until  all  children  complete  and  is  reflected  in  parent 
by  block  j  immediately  before  j  begins  execution. 

The  average  cost  of  mainteiining  pairent  vectors  can  cilso  be  reduced  in  the  presence 
of  coordination.  For  instance,  consider  an  innermost  parallel  construct  C  such  that  no 
thread  created  in  C  coordinates  with  a  thread  outside  of  C.  The  amount  of  work  required 
for  the  fork  eind  join  operations  is  stUl  0(1).  Modification-sets  can  be  used  to  boimd  the 
cost  of  updating  a  parent  vector  at  a  coordination  point  to  be  a  function  of  the  number 
of  indirect  peirents  of  the  coordinating  block. 
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A  modification-set  is  aissociated  with  every  thread  and  stores  the  indirect  parents  of 
the  currently  executing  block.  The  modification-set  of  the  first  block  6  in  a  thread  is 
initialized  with  the  task  identifier  t„  assigned  to  b.  The  first  time  a  thread  coordinates,  a 
private  parent  vector  is  created  and  initiadized.  The  modification  set  and  parent  vector  for 
block  b  aire  updated  after  a  coordination  point  with  indirect  parents  p\  . . .  Pm  by  executing 
the  code  shown  in  Algorithm  3.5.  In  addition,  if  two  task  identifiers  in  the  modification 
set  have  the  same  task,  then  the  one  with  the  smaller  version  number  is  deleted. 

procedure  Build-Parent-Vector(b,  pi  . .  .pm) 
if  first  coordination  then 

paTeni{b)  :=  copy{parent{b)) 
endif 
for  all  p  €  Pi  ■  ■ .  Pm  do 

for  all  t,  G  modification- set{p) 

parent(b)[t]  :=  v 
end  for 

modification- 3 et{b)  :=  modification-set{b)  U  modification-3et{p) 
end  for 

modification- set{b)  :=  modification-set{b)  +  t^^^ 
end  procedure 

Algorithm  3.5:  Bmld  Parent  Vector  Using  Modification  Set 

The  cost  incurred  at  a  coordination  point  increases  with  the  nimiber  of  indirect  an- 
cestors which  is  a  function  of  the  pattern  of  coordination.  0(A'^)  modification-sets  can  be 
used  to  support  more  general  coordination  patterns. 

Lastly,  one  can  consider  optimizations  that  work  only  for  "importzint"  subclasses  of 
pairallel  programs.  For  instance,  the  class  of  parallel  prograims  supported  by  the  on-the- 
fly  technique  of  Hood,  Kennedy  and  Mellor-Cnmamey  [HKMC89]  contains  programs  with 
pciradlel  constructs  and  doacross  coordination  with  distance  1.  This  anomaly  detection 
technique  requires  0(1)  work  for  managing  ancestor  information  at  peircJlel  operations 
and  0{log{N))  work  to  determine  if  two  blocks  aie  concurrent. 

The  costs  incurred  by  tiisk  recycling  can  be  similarly  decreased  for  this  class  of  parallel 
programs.  Specifically,  the  cost  of  maintaining  pzo-ent  vectors  at  doacross  coordination 
points  can  be  eliminated  by  associating  a  shared  iterate  vector  with  the  shared  parent 
vector.  The  t"*  entry  of  iterate  contains  the  iterate  of  the  loop  that  is  assigned  to  task  t,  if 
any.  Otherwise,  the  t^^  entry  contains  W^  +  1  where  W  is  the  nimaber  of  iterates  created. 
A  block  6  is  in  a  higher  iterate  than  a  block  with  task  identifier  („  if  and  only  if  the 
iterate  niunber  of  b  is  higher  than  iterate[t].  The  iterate  vector  cam  be  used  to  determine 
if  two  blocks  &ie  concurrent  after  a  sequence  of  doacross  coordinations,  without  storing 
any  additional  parent  information. 
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After  every  doacross  operation,  each  block  increments  its  version  number.  The  originzd 
parent  vector  check  is  used  to  tell  if  a  block  b  is  concurrent  with  a  block  with  task  identifier 
(„  whenever  t„  is  in  a  higher  iterate  than  b.  If  <„  is  in  a  lower  iterate  than  6,  block  b  first 
normalizes  the  version  number  stored  in  the  peirent  vector  by  the  number  of  doacross 
operations  that  b  has  executed,  and  then  performs  the  steindard  parent  vector  check. 
Algorithm  3.6  shows  the  algorithm  used  to  perform  the  check  where  k  is  the  number  of 
doacross  operations  that  have  been  executed.  This  optimization  only  slightly  increases  the 
cost  of  telling  if  two  blocks  aie  concurrent  while  eliminating  aH  parent  vector  maintenance 
at  coordination  points. 

procedure  IsConcuT~rent(b,  t^) 
if  my-iterate  <  iterate[t]  then 
if  poren<[t]  <  v  then 

return  true 
else  return  false 
else 

if  pore7i<[<]  +  k  <  V  then 

return  tT~u,e 
else  return  false 
endif 
end  procedure 

Algorithm  3.6:  Check  Concurrent  Using  Iterate  Vectors 
Other  optimizations  axe  discussed  in  Chapter  4. 

3.3     Task  Assignment  Problem 

The  use  of  parent  vectors  to  store  the  ancestor  relationship  assimies  the  existence  of  an 
algorithm  for  assigning  tasks  to  blocks.  A  block  b  can  be  validly  assigned  to  a  task  t  if 
cind  only  if  6  is  not  concurrent  with  amy  other  block  previously  assigned  to  t.  Since  the 
nimiber  of  tasks  used  in  a  task  assignment  determines  the  length  of  parent  vectors,  the 
goed  of  a  task  assignment  algorithm  is  to  use  &s  few  tasks  as  possible  while  maintaining 
a  valid  task  assignment.  Theorem  7  gives  the  lower  bound  on  the  number  of  tasks  in  a 
valid  task  assignment. 

Theorem  7:  The  minimum  nimiber  of  tjisks  needed  to  perform  a  valid  as- 
signment to  a  POEG  is  equal  to  the  mAYimnm  concurrency  of  the  POEG. 

Any  valid  task  assignment  that  uses  the  minimed  nimiber  of  taisks  is  said  to  be  optimal '. 
This  and  the  following  section  discuss  the  task  assignment  problem  cind  present  several 


*This  property  was  first  proved  by  Dilworth  and  defined  as  the  number  of  chains  required  to  cover 
every  element  in  a  general  partial  order  [Dil50,Per63]. 
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task  assignment  algorithms. 

There  are  several  different  systems  that  use  data  structiires  similar  to  pzirent  vectors 
for  maintaining  concurrency  relationships  [Fid88,MC88].  All  of  these  systems  eissume  a 
constant  degree  of  paradlelism  throughout  the  program  execution,  thereby  avoiding  the 
issue  of  task  assignment.  (Since  every  block  has  exactly  one  direct  parent  and  one  direct 
child,  each  block  is  simply  assigned  the  same  task  as  its  direct  parent  block  with  an 
incremented  version  number.)  The  following  algorithms  can  also  be  used  to  expand  the 
class  of  parEillel  programs  supported  by  these  systems. 

Let  us  first  consider  off-line  task  assignment  in  which  the  entire  POEG  is  avjiilable 
when  the  assignment  is  performed.  Theorem  8  gives  an  upper  boimd  on  the  work  required 
to  perform  an  optimal  assignment. 

Theorem  8:   The  complexity  of  performing  an  off-line  optimal  task  assign- 
ment is  equivzdent  to  the  complexity  of  maximum  bipeirtite  matching. 

Corollary  2:  There  exists  an  0(J3^^)  off-line  optimal  task  aissignment  algo- 
rithm. 

The  proof  for  Theorem  8  is  constructive,  but  requires  the  entire  POEG  to  be  avadlable. 
The  algorithm  is  of  little  use,  therefore,  for  on-the-fly  anomaly  detection.  However,  it 
can  be  used  to  perform  t£isk  assignment  for  trace-based  anjilysis.  In  fact,  this  is  the 
best  generad  task  assignment  adgorithm  (since  there  is  also  a  reduction  from  maTimiiTn 
bipartite  matching).  Any  improvements  in  TnaTiTnnm  bipartite  matching  will  lead  to  a 
similar  improvement  in  the  off-line  tatsk  assignment  algorithm. 

On-the-fly  anomaly  detection  reqtiires  an  on-line  task  assignment  algorithm  that 
makes  decisions  based  on  the  current  state  of  the  POEG.  Unfortunately,  there  is  no  op- 
timal on-line  task  assignment  algorithm.  Theorem  9  gives  a  lower  bound  on  the  number 
of  tasks  required  in  the  worst  cztse. 

Theorem  9:  A  lower  bound  on  the  largest  ntmaber  of  tasks  needed  for  amy 
on-line  task  assigmnent  algorithm  on  POEGs  is  ^  —  log{T). 

It  is  important  to  note  that  whether  or  not  a  task  assignment  is  optimal  does  not  zif- 
fect  the  correctness  of  the  conctirrency  relationship  represented  by  the  parent  vectors; 
suboptimality  only  increases  the  overheads  associated  with  peirent  vector  maintenance. 

To  illustrate  the  difficiilty  of  performing  an  optimad  on-line  task  assignment,  consider 
the  POEG  in  Figure  3.3.  At  this  stage  in  the  execution,  block  6i  has  been  created  but 
is  not  yet  assigned.  After  blocks  Is  and  2^  coordinate  synchronously,  free  tasks  4  and  6 
are  available  for  assignment  to  children  of  both  I5  and  2^.  In  particiilar,  block  61  can  be 
assigned  either  task  4  or  6.  However,  depending  on  the  future  execution  of  the  program, 
either  choice  may  be  suboptimal. 
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Figure  3.3:  Pcirtial  Task  Assignment 

Suppose  61  is  assigned  task  6.  At  some  time  in  the  future,  block  Sa  coiild  create  two 
new  threads,  62  and  63.  This  POEG  is  shown  in  Figure  3.4.  Both  62  and  63  are  concurrent 
with  block  4i  and  therefore  cannot  use  task  4.  A  new  task  must  be  generated  to  assign 
one  of  62  or  63,  the  other  can  be  assigned  task  83.  This  forces  seven  tasks  to  be  used  in 
the  assignment  even  though  the  graph  has  a  meiximvmi  concurrency  of  six.  (A  symmetric 
scenario  holds  if  bi  is  assigned  task  4.) 

The  following  definitions  aie  used  in  the  proofs  of  Theorem  7  and  supporting  Lemma  2 
and  Lemma  3.  Let  G  =  {V,  E)  be  a  POEG,  Vi  C  F  and  V2  =  F  -  Vi,  and  let  0  be  a  set 
of  block  edges  from  vertices  in  Vi  to  vertices  in  ^2-  0  is  an  oriented  cut  if  and  only  if  all 
edges  in  E  that  have  one  end  point  in  Vi  eind  one  in  V2  are  directed  from  Vi  to  ^2-  A 
path  cover  is  a  set  of  paths  &om  the  root  to  the  leaf  node  which  together  traverse  every 
block  edge  in  the  graph  at  least  once. 

Lemma  2  A  set  of  block  edges  are  concurrent  if  and  only  if  they  are  in  an  oriented  cut. 

Proof. 

<;=  Suppose,  for  the  Scike  of  contradiction,  that  there  are  two  edges  (v,,  oq)  and  {akiVj) 
in  an  oriented  cut  cind  (uj,  oq)  is  an  Jincestor  of  (afc,  Vj).  From  the  definition  of  an  oriented 
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Figure  3.4:  Partiiil  Task  Assignment 

cut  we  know  that  Vj  and  ak  G  Vi  auid  oq  and  Vj  6  V2.  Consider  all  of  the  vertices  ai  ...Ok 
on  a  path  from  oq  to  Vj.  If  ai  e  Vi,  the  edge  (ao,ai)  would  go  from  V2  to  Vj;  therefore, 
ai  must  be  in  Vj-  By  continuing  this  argument  we  can  deduce  that  ak-\  must  be  in 
V2.  However,  from  our  original  assumption  we  know  that  a^  is  in  Vi.  Therefore,  if  two 
non- concurrent  blocks  are  in  the  cut  there  must  be  an  edge  from  V2  to  Vi. 
=>  Suppose,  for  the  sake  of  contradiction,  that  there  is  a  set  of  concurrent  blocks  B  that 
is  not  a  subset  of  any  oriented  cut.  There  must  be  a  maximal  subset  B'  of  B  such  that  all 
blocks  in  B'  are  in  at  least  one  oriented  cut  0,  but  B'  is  never  in  a  cut  with  some  block 
b  E  B  —  B'.  Since  O  is  a  cut,  the  set  of  blocks  A  =  O  -  B'  must  contain  only  ancestors  or 
only  descendjints  of  b.  We  know  there  are  no  paths  between  A  and  any  other  block  in  0 
(since  till  blocks  in  a  cut  are  concurrent)  or  between  zoiy  block  in  B'  and  b.  U  A  contains 
aincestors  of  6,  we  cam  move  A  and  all  blocks  on  paths  from  j4  to  6  to  Vi  aind  make  6  peirt 
of  0;  if  i4  contains  descendants  of  6,  we  move  A  aind  all  blocks  on  paths  from  6  to  A  to  V2 
and  make  b  part  of  O.  Because  we  can  add  6  to  O,  a  contradiction  has  been  reached.     D 
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Lemma  3  The  number  of  paths  needed  to  cover  a  POEG  is  equal  to  its  maximum  con- 
currency. 

Proof.  (By  induction  on  the  maximum  concurrency  T  of  a  POEG  G)  : 

Base  Case  (T  =  1):  All  blocks  in  G  have  a  single  child,  so  G  can  be  covered  with  one 

path. 

Induction  Hypothesis:  Suppose  that  the  lemma  holds  for  all  POEGs  with  T  <  i;  prove 

that  all  POEGs  with  T  =  i  can  be  covered  with  i  paths.  Create  the  leftmost  path  L  from 

the  root  to  the  leaf  block  and  let  M  be  the  set  of  blocks  in  L  that  appear  in  cuts  of  size  i. 

We  define  a  null  vertex  to  be  any  vertex  with  one  parent  block  aind  one  child  block.  We 

win  modify  G  to  obtain  a  new  POEG  G'  with  maximum  concurrency  i  -  1  by  deleting  all 

blocks  in  M. 

In  the  first  step  we  modify  every  vertex  that  is  the  head  or  tail  of  a  block  6  —  {vp,Vc) 
e  M.  There  can  never  be  more  than  one  path  through  b;  this  foUows  directly  from 
Lemma  2  and  the  fact  that  a  path  Ccinnot  traverse  concurrent  blocks.  Therefore,  if  Vp  is  a 
receiver  vertex,  its  associated  coordination  edge  will  never  be  traversed  cind  ccin  be  deleted 
thereby  chjinging  Vp  into  a  null  vertex.  We  know  that  Vp  Ccinnot  be  a  join  vertex:  otherwise, 
the  pcirent  blocks  of  b  wovdd  be  peirt  of  a  subgraph  with  maTimnm  concurrency  less  than 
i.  By  s)Tnmetric  cirguments,  if  Vc  is  a  sender  vertex  then  we  delete  the  coordination  edges 
cind  chcinge  it  to  a  null  vertex,  and  we  know  that  it  caimot  be  a  fork  vertex.  After  making 
this  transformation,  every  block  edge  in  M  has  a  pjirent  vertex  that  is  either  a  sender, 
nuU,  or  fork  vertex  and  a  child  vertex  that  is  either  a  receiver,  nuU,  or  join  vertex. 

The  following  properties  hold  for  every  maximed  sequence  S  of  blocks  in  M: 

•  5  begins  with  either  a  fork  or  receiver  vertex:  if  S  begins  with  a  null  vertex  v,  the 
pcirent  block  of  v  would  be  in  a  directed  cut  of  size  at  least  i  and  be  part  of  S. 

•  Likewise,  5  ends  with  either  a  join  or  sender  vertex. 

•  Every  interned  vertex  v  G  5  is  a  null  vertex:  v  Ccinnot  be  either  a  sender  or  fork 
vertex  because  it  is  the  child  vertex  of  some  block  in  Af ,  and  v  cannot  be  either  a 
receiver  or  join  vertex  because  it  is  the  pjirent  vertex  of  some  block  in  M. 

Each  sequence  5  in  M  is  deleted  from  G  creating  a  new  POEG  G'  with  mciximum  con- 
currency i  —  1.  Because  each  5  starts  with  a  vertex  with  multiple  child  blocks,  ends  with 
a  vertex  with  m\iltiple  parent  blocks,  and  has  no  internal  coordination  edges,  G'  remains 
well  formed.  By  our  induction  hypothesis,  G'  can  be  covered  with  i  —  1  paths  aind  therefore 
the  lemma  holds  for  all  i.  D 

Theorem  7  The  minimum  number  of  tasks  needed  to  perform  a  valid  assignment  to  a 
POEG  is  equal  to  the  maximum  concurrency  of  the  POEG. 
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Proof.  Let  A  be  the  number  of  tasks  needed  to  perform  the  smallest  vzdid  assignment  of 
a  POEG  and  T  the  maximum  concurrency  of  the  POEG.  From  Lemma  3  we  know  that 
the  number  of  paths  C  needed  to  cover  a  POEG  is  equal  to  T.  All  blocks  on  a  path  can 
be  assigned  to  the  same  task;  by  definition  they  are  related  by  an  aincestor/descendeint 
relationship  and  by  Theorem  1  are  not  concurrent.  Thus,  the  number  of  tcisks  needed  for 
a  valid  assignment  of  the  POEG  is  less  than  or  equad  to  the  size  of  the  path  cover,  so  that 
A  <  C  —  T.  However,  each  concurrent  block  needs  a  unique  task  so  that  A>T.  Since 
A<T  and  i4  >  T,  it  must  be  the  case  that  the  size  of  the  smallest  valid  task  assignment 
is  equal  to  the  maTimnm  concurrency  of  the  POEG.  D 

Theorem  8  The  complexity  of  performing  an  off-line  optimal  task  assignment  is  equiva- 
lent to  the  complexity  of  m.aximum  bipartite  matching. 

Proof.     The  proof  of  Theorem  8  consists  of  a  linear  time  reduction  of  optimal  taisk 
assignment  to  TnaTimnm  bipartite  matching  and  a  linecir  time  reduction  of  maximimi 
bipartite  matching  to  optimal  task  assignment. 
<=  We  construct  a  bipeirtite  graph  Gb  =  {{PfC},E)  for  a  given  POEG  G  such  that: 

|P|  =  Id  =  the  number  of  blocks  in  the  POEG,  and 

{pi,  Cj)  £  E  if  cind  only  if  pi  £  P,  Cj  G  C  and  6j  is  an  Jincestor  of  bj  in  G. 

There  is  sm  ancestor  set  associated  with  each  vertex  v  in  the  POEG  that  is  computed  by 
performing  a  topologicail  traversal  of  the  POEG.  If  a  block  b  ends  in  a  fork  or  join  vertex  v, 
0{B)  work  is  required  to  update  the  information  associated  with  v.  If  6  ends  in  a  sender 
vertex,  it  performs  a  union  of  its  eincestor  information  with  the  ancestor  information  of 
the  other  senders  of  that  coordination  point.  This  composite  ancestor  set  is  used  by  each 
of  the  receiver  blocks  of  the  coordination  point.  For  programs  that  meet  the  coordination 
constraint^,  the  total  amount  of  work  required  to  compute  the  transitive  closure  is  0(5^). 
A  matching  M  consists  of  |M|  edges  where: 

-  Mc  denotes  the  end  points  in  M  from  vertex  set  C, 

-  Mp  denotes  the  end  points  in  M  from  vertex  set  P, 

-  The  two  endpoints  of  an  edge  in  M  are  partners, 

-  The  two  vertices  that  correspond  to  the  same  block  (e.g.  pi  and  c^)  are  mates. 

Two  properties  foUow  from  these  definitions: 


*For  POEGs  that  do  not  meet  the  coordination  constraint,  the  amount  of  work  required  is  bounded  by 
0{B^  X  T).  If  one  wants  to  consider  the  problem  of  task  assigrunent  to  general  partial  orders,  the  amount 
of  work  required  to  compute  the  ancestor  sets  (e.g.  the  transitive  closure  of  the  partial  order)  is  equivalent 
to  performing  matrix  multiplication. 
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1.  Every  matching  M  of  Gb  corresponds  to  a  vedid  task  assignment  which  uses  \C\  - 
\M\  tasks.  Each  vertex  ^  Mc  is  assigned  to  a  unique  task;  each  vertex  G  Mc  is 
assigned  to  the  task  associated  with  the  mate  of  its  partner  vertex  with  one  higher 
version  number.  This  assignment  algorithm  is  valid  because  each  task  is  assigned  to 
a  set  of  non-concurrent  blocks,  and  every  block  has  a  task  assigned  to  it.  Moreover, 
the  nimiber  of  tasks  required  for  the  assignment  is  equal  to  \C\  -  \M\. 

Figure  3.5  shows  an  example  matching.  This  woidd  resTilt  in  in  6i  being  assigned  Ij 


Figure  3.5:  Exeimple  Bipao'tite  Matching 

aind  62  assigned  2i  since  they  are  not  in  Mc-  Block  63  wotild  be  assigned  22  since  the 
mate  (C2)  of  63's  partner  (^2)  was  assigned  task  2i.  Similarly,  64  would  be  assigned 
I2  and  65  assigned  I3. 

2.  Every  valid  task  assignment  which  uses  T  tasks  has  a  corresponding  matching  M  of 
Gb  of  size  \C\  -  \T\.  For  each  task  t  add  to  M  every  edge  in  Gb  which  connects  pt^ 
and  Cf.^j  {tj  denotes  the  ji"*  block  assigned  to  task  t).  This  is  a  matching  because 
each  block  is  assigned  only  one  task  and  there  is  an  edge  between  two  nodes  whenever 
it  is  required.  The  only  vertices  in  C  which  are  not  included  axe  the  first  blocks 
assigned  to  each  task  (e.g.  to  for  each  task  t)  and  hence  \M\  is  \C\  -  \T\. 

It  follows  directly  from  1  and  2  that  by  creating  a  maximima  matching  we  obtcdn  an 
optimal  task  assignment. 

=>  We  are  given  Jin  insteince  of  a  maximum  matching  problem  for  bipcirtite  graphs  with 
graph  Gb  =  ({-P, C}, E)  such  that  \P\  =  \C\  =  n,  and  for  all  edges  (p, c)e  E,p£  P  and 
c  E:  C.  We  cam  transform  this  into  a  problem  of  task  assignment  by  creating  a  parallel 
program  with  the  corresponding  POEG  G  as  is  shown  in  Figure  3.6.  G  initiadly  creates 
two  sets  of  pziraUel  threads:  pi . .  .pn  and  ci . .  .c„.  Each  p;,  creates  two  pcircJlel  blocks 
Pij  and  pij  that  are  sequentially  followed  by  a  sequence  of  blocks  that  are  separated  by 
sender  vertices.  Each  c,j  is  followed  by  a  sequence  of  blocks  that  are  sepjirated  by  receiver 
vertices  and  then  two  paradlel  blocks  c^j  and  c^,.  There  are  asynchronous  coordination 
edges  from  thread  p,  to  thread  Cj  if  and  only  if  there  is  an  edge  from  vertex  p,-  to  vertex 
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parallel  case 

doall  c  :=  1  to  n 
for  all  p  do 

if  (p,c)  €  E  then 

receive(p) 
endif 
end  for 
doall  j  :=  1  to  2 

null  statement 
endall 
endall 
parallel 

doall  p  :=  1  to  n 

doall  j  :=  1  to  2 

null  statement 
endall 
for  all  c  do 

if  (p,c)  €  E  then 

send(c) 
endif 
end  for 
endall 
parallel  end 


Figure  3.6:  Example  POEG 

Cj  in  Gb-  The  number  of  block  edges  in  (?  is  5C  +  5P  +  2E  +  6  and  the  nimiber  of 
coordination  edges  is  E  so  that  this  transformation  requires  0{C  +  P  +  E)  space  and 
work. 

The  niimber  of  tasks  required  to  assign  G  if  the  coordination  edges  are  ignored  is  4n. 
Consider  some  task  assignment  A.  Assume  for  all  i  that  c,,  and  c^,  cire  assigned  the  same 
task  and  that  pi^  and  pij  are  sissigned  the  same  tcisk.  (Which  of  c,,  or  c,,  is  assigned  the 
scmie  task  as  c^,  does  not  affect  the  optimality  of  ein  assignment.)  If  \A\  <  An,  then  a  task 
T  was  assigned  to  both  pj,  and  Cj^  for  some  t  and  j. 

Consider  any  maximum  matching  M  for  Gb-  There  is  a  corresponding  optimal  task 
assignment  A  that  uses  4n  -  \M\  tasks.  In  A,  block  Cj,  is  assigned  to  the  same  task  as 
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block  Pi,  if  Jind  only  if  there  is  an  edge  (pi,  Cj)  in  M .  If  there  was  a  better  task  assignment 
in  G  it  would  result  in  a  Icirge  matching  in  Gb-  Conversely,  an  optimal  assignment  A 
corresponds  to  a  matching  M  of  size  4n  —  \A\  in  which  an  edge  {pi,  Cj)  is  included  in  Af  if 
and  only  if  block  Cj,  is  assigned  to  the  same  task  as  block  p,,  in  A.  Therefore,  the  amount 
of  work  required  to  perform  an  optimal  task  assignment  is  an  upper  bovmd  on  the  work 
required  to  create  a  meiximum  matching.  D 

Corollary  2    There  exists  an  0(5^'^)  off-line  optimal  task  assignment  algorithm. 

Proof.  This  foUows  directly  from  Theorem  8  and  results  by  Hopcroft  and  Keirp  [HK73]. 


Lemma  4   There  exists  a  POEG  with  maximum  concurrency  2'"*"^  —  1  such  that  at  least 
2'+i  -)-  2'"^  —  2  tasks  are  required  by  any  on-line  valid  task  assignment. 

Proof.     (By  adversary  argument):    Consider  the  POEG  in  Figure  3.7.     It  has  three 


Figure  3.7:  POEG  at  time  t 

branches:  two  have  a  fork-join  set  contcdning  2*  —  1  parallel  blocks  and  one  is  a  sin- 
gle thread.  The  set  Xi  is  a  child  of  both  Li  and  Ri.  The  maximum  concurrency  of  this 
graph  is  therefore  2*-14-2'-l-|-l  =  2'+^  -  1. 

A  task  is  reachable  from  a  block  b  if  and  only  if  it  was  last  assigned  to  an  ancestor 
of  b.  Let  avail),  be  the  nimiber  of  free  tasks  reachable  by  block  6.  We  will  order  the 
execution  such  that  ail  of  the  blocks  in  Xi  are  assigned  before  either  block  /  or  r.  The 
tasks  last  assigned  to  block  x  and  the  blocks  of  Li  and  Ri  are  reachable  from  the  blocks 
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in  Xi.  After  the  task  assignment  of  the  blocks  of  Xi,  the  number  of  tasks  reachable  from 
/  and  r  is  equal  to  the  sum  of  the  sizes  of  Li  aind  Ri  less  the  tasks  used  from  them  during 
the  assignment;  namely,  avails  +  availi  =  2  x  (2*  -  1)  -  (2'  -  2)  =  2'.  Therefore,  either 
availi  or  availr  must  be  less  thain  or  equal  to  2*"^. 

If  availi  <  2'"^,  we  create  a  subgraph  of  /  that  is  similcir  to  the  original  graph  except 
that  each  fork-join  set  contains  2'"^  -  1  parallel  blocks  (shown  in  Figure  3.8a);  a  sym- 


(a) 


(b) 


Figure  3.8:  POEG  at  time  t  +  1  if  availi  <  2 


,i-l 


metric  construction  is  used  if  availr  <  2*~^.  We  first  show  that  this  modification  does 
not  change  the  maximimi  concurrency  of  the  POEG.  There  are  four  groups  of  blocks  that 
form  majcimally  oriented  cuts  for  each  of  the  three  branches  of  the  graph  (as  is  seen  in 
Figure  3.8b);  each  group  has  maximum  concurrency  of  2'  —  1.  The  three  oriented  cuts  of 
the  entire  graph  that  pass  through  more  thain  one  of  these  groups-jA,  Z),c},  {B,D,c}, 
and  {B,  C,  <i}-all  have  maximum  concurrency  2'"'"^  -  1  and  therefore,  the  maximum  con- 
currency of  the  graph  remains  2'+^  -  1. 
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The  number  of  tasks  needed  for  the  task  assignment  of  the  subgraph  is  equal  to 
its  maximum  concurrency.  Hence,  the  nimaber  of  extra  tasks  required  is  the  maTiminn 
concurrency  less  the  number  of  tasks  reachable  from  /;  namely,  it  is  at  least  (2*  — 1)-(2*~^). 
Therefore,  the  toted  number  of  tasks  required  for  a  valid  assignment  to  the  entire  graph 
is  at  least  (2'+^  -  1)  +  (2*-i  -  1).  D 

Theorem  9  A  lower  bound  on  the  largest  number  of  tasks  needed  for  any  on-line  task 
assignment  algorithm  on  POEGs  is  ^  —  log{T). 

Proof.  We  stcirt  with  the  graph  In  Figure  3.7  as  described  in  the  proof  of  Lemma  4. 
The  blocks  in  the  subgraph  have  no  reachable  tasks,  aind  hence  Lemma  4  czin  be  applied 
recursively  to  this  subgraph.  If  we  start  with  an  initizd  fork-join  set  size  of  2*  —  1,  we 
cam  create  i  -  I  nested  subgraphs;  the  fork-join  sets  of  the  last  subgraph  consist  of  a 
single  paTcdlel  block.  This  construction  never  increases  the  maximum  concurrency  T  of 
the  graph,  which  is  2*"^^  —  1,  and  the  number  of  extra  tasks  required  is  at  least: 

X:  r-i  -  1  =  £  2^-  -  (z-  -  1)  =  2'  -  (t  +  1)  >  I  -  logiT) 
3=1  i=i  ^ 

Therefore,  the  toted  nimiber  of  tasks  needed  to  validly  assign  this  POEG  is  at  least 

f  -  login  ° 

3.4      On-line  Task  Assignment  Algorithms 

This  section  presents  a  basic  adgorithm  for  performing  task  assignment  and  several  other 
variations.  The  algorithms  attempt  to  limit  the  number  of  tasks  in  an  assignment  by 
using  a  "most  recently  used"  (MRU)  heuristic.  The  intuition  behind  this  heuristic  is  that 
if  a  block  a  Ccin  validly  be  assigned  to  the  free  task  that  was  last  assigned  to  block  6R, 
then  a  can  cdso  validly  be  assigned  to  any  free  task  that  was  last  assigned  to  an  cincestor 
of  b.  A  free  task  dag  stores  information  about  the  tasks  that  cire  available  for  reassignment 
as  well  as  the  concurrency  relationship  of  the  blocks  that  were  last  assigned  those  tasks. 

3.4.1      MRU  Task  Assignment  Algorithm 

Suppose  a  set  of  tasks  T  are  currently  not  in  use  and  block  b  needs  to  be  assigned  a  task. 
The  MRU  algorithm  will  assign  b  a  task  tj  £  T  such  that  the  following  two  conditions  are 
met: 

1.  Task  tj  was  last  assigned  to  a  block  a  that  is  a  direct  eincestor  of  b,  and 

2.  No  task  in  T  was  last  assigned  to  a  descendant  of  a. 
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The  first  condition  guarantees  that  the  assignment  is  valid  (e.g.  no  task  is  assigned  to 
two  concurrent  blocks).  The  second  condition  enforces  a  "most  recently  used"  heuristic 
for  limiting  the  number  of  tasks  used  in  the  eissignment. 

To  illustrate  the  MRU  algorithm  consider  the  POEG  in  Figure  3.9a.  Blocks  61  . .  .65 
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Figure  3.9:  Optimal  Task  Assignment  to  a  POEG 

are  unassigned  eind  tasks  1,  2,  3,  and  4  are  avcdlable  for  reMsignment.  Suppose  bi  is 
the  next  block  to  be  assigned.  Block  61  may  be  validly  assigned  either  task  1,  3,  or  4; 
block  61  cannot  be  assigned  to  task  2  since  61  is  concurrent  with  block  22-  Based  on  the 
MRU  heuristic,  block  tj  will  be  assigned  either  task  1  or  3.  Figure  3.9b  shows  an  optimal 
assignment  made  by  the  MRU  task  assignment  algorithm  in  which  block  61  is  assigned 
task  1.  A  symmetric  assignment  results  if  block  61  is  assigned  task  3.  However,  if  block 
61  is  assigned  task  4  the  overall  assignment  cannot  be  optimal:  a  new  task  will  have  to 
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be  created  to  assign  to  either  62  or  63. 

The  MRU  ztlgorithm  can  be  implemented  at  run-time  by  maintaining  a  dag  of  free 
tasks  that  are  available  for  reassignment.  The  internal  nodes  of  the  free  task  dag  cire 
associated  with  fork  and  join  operations,  aind  the  leaf  nodes  are  associated  with  currently 
executing  blocks.  A  set  of  free  tasks  is  associated  with  each  node.  The  outnode  set  of 
a  node  n  in  a  free  task  dag  is  the  set  of  nodes  to  which  n  points.  The  entry  point  of 
an  executing  block  b  is  the  node  in  the  free  task  dag  that  is  aissociated  with  b.  Since  we 
assign  tasks  in  the  reverse  order  of  the  execution  flow,  the  edges  in  the  free  task  dag  point 
towzirds  the  root  of  the  dag. 

When  a  block  b  is  created  after  operation  0,  an  entry  point  n\,  is  created  for  6,  an 
edge  is  added  from  nj,  to  the  node  no  associated  with  operation  0,  and  a  task  is  assigned 
to  b.  To  assign  a  task  to  a  block  6,  the  free  task  dag  is  traversed  starting  at  nj,  imtil  a 
node  with  a  non-empty  free  task  set  is  found.  When  a  block  6  terminates  at  a  fork  or  join 
operation  0,  it  updates  the  node  no  associated  with  O  by  adding  its  free  task  set  and 
outnodes  to  node  uq  and  deleting  entry  point  nj,.  It  also  adds  its  task  to  the  free  task  set 
of  node  no^ 


(a) 
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Figure  3.10:  Construction  of  Free  Task  Dag 

Figure  3.10  shows  the  structure  of  the  free  task  dag  after  various  operations.  (An  entry 
point  node  is  represented  by  a  box  and  an  operation  node  is  represented  by  a  circle.)  The 
state  of  the  free  task  dag  after  blocks  61,  62?  aiid  63  tire  created  by  fork  operation  /  is 
shown  in  Figure  3.10.  Figures  3.10b  Jind  3.10c  respectively  show  the  free  task  dag  just 
before  i»i,  62  and  63  terminate,  and  immediately  after  they  terminate  in  join  operation  j. 

Unfortunately,  when  the  free  task  dag  is  maintained  in  this  manner,  it  wiU  grow  in 
proportion  to  the  length  of  the  execution.  In  addition,  task  assignment  is  computation- 
aJly  expensive  if  long  paths  of  nodes  with  empty  free  task  sets  are  traversed  before  an. 
available  free  task  is  found.    The  following  path  compression  operations-sw6suTne  and 
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collapse-keep  the  free  task  dag  compact  and  task  assignment  eificient  by  deleting  \m- 
needed  nodes.  Lemma  6  proves  that  by  performing  these  two  optimizations  whenever 
possible,  the  structure  of  the  free  task  dag  is  restricted  to  being  a  tree.  The  means  that 
every  node  has  a  single  outnode.  In  addition,  every  internal  node  has  a  non-empty  free 
task  set.  These  two  conditions  simplify  the  code  for  the  management  of  the  free  task 
dag.  The  algorithms  for  maintedning  the  free  task  dag  when  a  block  terminates  at  fork 
or  join  operations  and  for  assigning  tcisks  to  newly  created  block  are  shown  in  Algorithm 
3.7.  When  a  block  performs  a  coordination  operation,  it  simply  increments  its  version 
nimiber. 

procedure  Forkfb,  O,  C\...cw) 
no  :=  Ti6 

no  -free-tasks  :=  no  -free-tasks  U  task 
for  all  children  blocks  Ci  . .  .cw  do 
make  and  initialize  node  tIc^ 
nf^-outnode  :=  no 
no-innode  :=  no-innode  +  nc^ 
endfor 
end  procedure 

procedure  Join(b,  O,  Cj) 
c  —  ni,-outnode 
if  no  =  null  node  then 

make  and  initialize  node  no 

no-ontnode  :=  c 

c.innode  :=  c.innode  +  no 

nej  ■-=  no 
endif 

c-innode  :=  c.innode  —  n^ 
no  -free-tasks  :=  no  -free-tasks  U  task 
remove  node  nj, 
end  procedure 

procedure  Assign-Task(ni),  task) 
if  n-free-tasks  ^  0  then 

pick  task  from  n. free-tasks 
else  if  n.  outnode.  free -tasks  ^  0  then 

pick  task  from  n. free-tasks 
else  task  :=  new  task 
end  procedure 

Algorithm  3.7:  Task  Assignment  and  Free  Task  Management 

Whenever  a  free  task  set  of  an  internal  node  n  becomes  empty,  the  dag  is  collapsed  by 
deleting  n  from  the  dag.  (The  only  exception  is  the  root  node,  which  is  never  collapsed.) 

61 


All  nodes  that  point  to  n  are  modified  to  instead  point  to  the  outnode  of  n.  This  guaran- 
tees that  every  internal  node  in  the  free  task  dag  has  at  least  one  free  task  associated  with 
it.  A  coUapse  operation  can  be  performed  eifter  a  task  is  assigned.  Whenever  an  intem£d 
node  n  is  the  outnode  of  a  single  node  p,  n  is  subsumed  by  p  by  adding  the  free  task  set  of 
n  to  the  free  task  set  of  p,  deleting  n  and  adjusting  the  edges  as  appropriate.  A  subsume 
operation  can  be  performed  whenever  a  node  is  deleted  from  the  free  task  dag.   Figure 
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Figure  3.11:  Path  Compression  Optimizations 

3.11  shows  the  changes  to  the  free  task  dag  after  node  nc  is  collapsed  and  then  node  715 
is  subsumed.  The  algorithms  for  subsuming  cind  collapsing  the  free  task  dag  are  shown 
in  Algorithm  3.8. 

To  illustrate  the  MRU  edgorithm,  consider  Figtire  3.12.  Figure  3.12  shows  the  state 
of  the  free  task  dag  at  specific  stages  in  the  execution  of  the  exemiple  program  in  Figure 
3.9.  Before  the  first  fork  operation  there  are  three  free  teisks;  naimely,  2,  3  aind  4.  The 
corresponding  state  in  the  free  task  dag  is  shown  in  Figure  3.12a.  Figure  3.12b  shows 
the  free  task  dag  when  the  program  is  in  the  state  in  Figure  3.9a.  At  this  point  in  the 
execution,  block  62  can  be  assigned  blocks  2  or  4.  (Block  61  can  be  validly  assigned  to  task 
1,  3  or  4.)  Because  task  2  was  more  recently  used  than  taisk  4,  block  62  is  assigned  task 
identifier  23.  Similarly,  block  61  is  aissigned  task  identifier  le-  (Block  61  could  equivailently 
be  assigned  task  identifier  33).  Figure  3.12c  shows  the  free  task  dag  when  the  program  is 
in  the  state  in  Figure  3.9b.  The  free  task  dag  has  been  collapsed  and  subsumed  leaving  a 
single  node. 

3.4.2     Complexity  of  the  MRU  Algorithm 

The  theorems  in  this  section  prove  the  complexity  of  the  MRU  algorithm. 

Theorem  10:   The  number  of  tasks  in  a  task  assignment  performed  by  the 
MRU  algorithm  is  T. 


62 


procedure  Check- Subsume (n) 
if  \n.innode\  =  1  then 

c  :=  n.outTiode 

p  :=  n.innode 

p. free-tasks  :=  p. free-tasks  U  n.free-tasks 

p.outnode  :=  c 

c.tnnode  :=  c.tnnode  +  p  —  n 

remove  node  n 
endif 
end  procedure 

procedure  Check- Collapse (n) 

if  n.free-tasks  —  0  and  n  ^  root  then 
c  :=  n.cwtnodc 
c.innode  :=  c.innode  —  n 
for  all  p  G  n.mncxie  do 
p.outnode  :=  c 
c.innode  :=  c.innode  +  p 
end  for 

remove  node  n 
end  if 
end  procedure 

Algorithm  3.8:  Free  Task  Path  Compression 

Theorem  11:  The  size  of  the  free  task  dag  used  by  the  MRU  algorithm  is 
0(T  +  r'). 

(Recall  that  T  denotes  the  maximum  concurrency  of  the  POEG  and  T'  denotes  the  max- 
imimi  concurrency  of  the  POEG  with  coordination  edges  deleted.)  Theorem  10  proves 
that  T'  tasks  are  used  by  the  MRU  algorithm  to  perform  a  valid  aissignment.  Therefore, 
the  MRU  algorithm  is  guaranteed  to  be  optimal  for  programs  that  do  not  include  £iny 
coordination  or  unconstrained  fork-join  operations.  Theorem  11  proves  that  the  size  of 
the  free  task  dag  is  bounded  by  0(T  +  T').  As  with  parent  vector  management,  this  cost 
is  a  worst  case  measurement  aoid  the  average  case  cost  is  much  lower. 

Theorem  12  proves  that  the  average  cost  per  block  is  proportional  the  maximum  level 
of  nesting  of  fork-join  constructs,  N. 

Theorem  12:  The  average  cost  per  block  of  performing  the  MRU  algorithm 
\sO{N). 

A  direct  consequence  of  Theorem  12  is  that  0(1)  work  is  performed  for  task  aissignment 
for  non-nested  pzirallel  constructs.  Regeirdless  of  the  level  of  nesting,  if  the  nimaber  of 
threads  created  in  a  pau'aillel  construct  is  always  greater  than  twice  the  level  of  nesting, 
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Figure  3.12:  Stages  in  Free  Task  Dag  for  POEG  in  Figure  3.9 

(e.g.  W  >  2N),  the  number  of  nodes  in  the  free  task  dag  is  O(^)  and  the  average  work 
for  a  task  assignment  is  0(1). 

Lemma  5    The  height  of  a  node  in  the  free  task  dag  can  never  increase. 

Proof.  Only  the  subsume  cind  collapse  operations  change  the  height  of  a  node  ia  the  free 
task  dag.  In  both  cases,  the  nodes  that  point  to  the  deleted  node  n  are  modified  to  point 
to  the  outnodes  of  n  rather  than  to  n.  Because  the  dag  is  acyclic,  this  Ccinnot  increase 
the  height  of  a  node  in  the  dag.  D 

Lemma  6  The  free  task  dag  for  the  MR  U  algorithm  is  always  a  tree  in  which  every  entry 
point  nb  is  at  height  at  most  N  +  1  where  N  is  the  fork- join  nesting  level  of  block  b. 

Proof.  Proof  by  induction  on  the  level  of  nesting  of  paireJlel  constructs: 
Base  Case  (N=l):  Before  the  first  fork  operation  there  is  a  single  node  n^  in  the  free  task 
dag  that  is  the  entry  point  of  the  current  block.  The  fork  operation  adds  the  task  of  the 
current  block  to  n,.  and  creates  k  entry  point  nodes  n\. . .  n^-one  for  each  block  created 
by  the  fork  operation-that  have  one  outnode  n,  (and  are  thus  at  height  2).  At  the  join 
operation,  the  entry  point  n^  of  the  block  after  the  join  operation  is  created,  amd  the 
outnodes  of  ni  . . .  Uj  become  the  outnodes  of  tij.  Node  tv  is  the  sole  outnode  of  ni . . .  nfc; 
therefore  nj  has  one  outnode,  n^.  After  n\  . .  .Uj  cire  deleted,  the  only  node  that  points 
to  Ur  is  Tij,  and  hence  n_,  will  subsiune  n,  resiilting  in  a  tree  of  height  1.  Any  subsequent 
fork-join  operations  perform  the  saime  trainsformations  to  the  graph. 
Induction  Hypothesis:  Assume  that  the  tree  structure  holds  for  all  POEGs  with  fork-join 
nesting  at  most  i:  prove  that  it  holds  for  all  POEGs  with  nesting  i  +  1.  The  entry  point 
nf  of  the  block  performing  a  fork  operation  of  nesting  i  +  1  is  at  height  at  most  i  +  1.  We 
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then  create  k  new  entry  points  n^  . .  .rik  that  have  a  sole  outnode,  n/  (eind  therefore  eire 
at  height  :  +  2).  There  are  two  cases  that  can  occur  during  task  assignment  of  the  blocks 
bi . .  .bk  created  by  the  fork  operation: 

1.  If  there  eire  enough  tasks  associated  with  n/  to  cissign  61  ..  .bk,  then  n/  remains  the 
outnode  of  ni  . .  .nfc.  After  we  perform  the  join,  the  entry  point  rij  is  the  sole  node 
that  points  to  n/  and  will  subsume  it.  Therefore,  the  height  of  rij  is  equail  to  the 
height  of  Tif  which  is  at  most  t  +  1. 

2.  If  there  are  not  enough  tasks  associated  with  n/  to  assign  bi  . .  .bj,  then  after  some 
task  bj  is  assigned,  node  n/  will  be  collapsed,  aind  all  of  the  nodes  that  point  to  it 
(neimely,  ni . .  .Uk)  will  be  modified  to  point  to  the  outnode  of  n/.  By  the  induction 
hypothesis,  rif  has  exactly  one  outnode  that  is  at  height  at  most  i.  Path  compression 
can  occur  many  times  during  the  t£isk  aissignment  process,  but  in  ail  cases  the  height 
of  nodes  n^  . .  .ni,  decrease,  and  they  all  have  the  same  outnode.  At  the  join,  the 
outnode  of  ni . .  .n^  is  made  the  outnode  tij,  Jind  therefore  rij  is  at  height  at  most 
t  +  1. 

In  both  cases  the  dag  is  {dways  a  tree  in  which  each  entry  point  is  at  height  at  most  one 
more  that  its  level  of  nesting.  D 

Theorem  10  The  number  of  tasks  in  a  task  assignment  performed  by  the  MRU  algorithm 
isT'. 

Proof.  For  a  given  POEG  G  there  is  a  simpUfied  POEG,  <?',  which  is  G  with  all  co- 
ordination edges  removed.  The  TnaximiiTn  concurrency  of  G'  is  equal  to  the  maximiim 
concurrency  of  G  with  all  coordination  edges  removed,  namely  Tqi  =  T'q. 

The  optimality  of  the  assignment  performed  by  the  MRU  algorithm  for  G'  is  baised  on 
three  properties: 

1.  Every  block  that  can  select  a  task  from  node  n  can  also  select  a  task  from  the 
outnode  of  n,  but  not  vise  versa. 

2.  The  selection  among  the  tasks  in  a  given  free  task  set  does  not  affect  the  optimality 
of  the  zissignment. 

3.  The  free  task  dag  correctly  represents  all  ancestor  tasks  for  each  block  6  in  C  (no 
block  in  G'  has  any  indirect  ancestors). 

Only  two  choices  aire  made  in  performing  a  task  aissignment:  which  free  task  set  F  to  use, 
and  which  task  to  use  from  F.  By  Lemma  6,  every  node  hais  a  single  outnode,  and  there 
is  always  at  most  one  lecif  iissociated  with  each  block  b.  Because  of  property  1,  it  is  always 
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optimal  to  assign  ail  of  the  tasks  from  a  leaS  node  n  in  the  free  task  dag  before  assigning 
a  task  from  the  outnode  of  n.  Therefore,  the  first  choice  optimally  selects  the  appropriate 
free  task  set  F.  Because  of  property  2,  the  choice  of  task  within  the  free  task  set  F  does 
not  affect  the  optimality  of  an  assignment.  Because  of  property  3,  the  task  assignment 
performed  must  be  optimal  and  by  Theorem  7  uses  Tq'  =  Tq  tasks.  Since  the  assignment 
performed  for  G  is  identicail  to  the  assignment  performed  by  G',  the  assignment  to  G  must 
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also  use  TA  tasks.  D 


Theorem  11    The  size  of  the  free  task  dag  used  by  the  MRU  algorithm  is  0(T  +  T'). 

Proof.  There  cein  be  at  most  T  leaf  nodes  since  every  leaf  node  is  the  entry  point  of 
an  executing  block,  and  T  is  the  maximum  concurrency  of  the  program.  Therefore,  the 
number  of  nodes  and  edges  in  the  free  task  dag  is  0(T).  The  size  of  the  free  teisk  sets 
is  O(T')  since,  by  Theorem  10,  there  cire  at  most  T'  tjisks  and  therefore,  at  most  T' 
unassigned  tasks.  Hence,  the  tot£il  size  of  the  free  task  dag  is  bounded  by  0(T  +  T').     D 

Theorem  12    The  average  cost  per  block  of  performing  the  MRU  algorithm  is  0{N). 

Proof.  Consider  the  operations  that  may  have  to  be  performed  for  each  block.  An 
assignment  and  (possibly)  a  coUapse  operation  aire  performed  when  a  block  is  created;  a 
fork  or  join  and  (possibly)  a  subsxmie  operation  is  performed  when  a  block  terminates. 
The  assignment,  join,  and  subsume  operations  each  performs  0(1)  work.  Although  the 
fork  operation  performs  0{W)  work  (where  W  is  the  nimaber  of  threads  created  by  the 
fork),  this  cost  is  an  average  of  0(1)  work  for  each  of  the  W  new  threads.  The  coUapse 
operation  performs  work  linear  in  the  number  of  nodes  that  point  to  the  coUapsed  node. 
However,  a  collapse  operation  is  not  performed  by  every  block. 

The  work  performed  per  block  for  coUapse  operations  c£in  be  czilculated  by  computing 
the  work  performed  over  the  lifetime  of  each  node  to  update  its  outnode  information  when 
other  nodes  are  collapsed.  This  figure  is  summed  over  all  nodes  in  the  free  task  dag,  F, 
and  then  divided  by  the  number  of  blocks,  B. 

The  only  nodes  whose  collapse  cam  eiffect  a  node  n  aie  the  nodes  on  the  path  from  n  to 
the  root  of  the  dag.  Whenever  the  outnode  of  n  is  collapsed,  0(1)  work  is  done  updating 
n,  and  the  height  of  n  is  decreased  by  one.  By  Lemma  6,  a  node  n  at  level  of  nesting  /  is 
originally  at  height  at  most  /  +  1.  Thus,  the  toted  eunount  of  work  done  updating  outnode 
information  for  the  entire  dag  is: 

F 

OiY,  height{ni))  <  0{FN) 
t=i 

(where  N  is  the  maximum  level  of  nesting).  Since  0(F)  =  0(5),  the  average  amount  of 
work  performed  per  block  is  0{N).  0 
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3.4.3      Modified  MRU  Task  Assignment  Algorithm 

The  MRU  assignment  algorithm  uses  T'  tasks  to  perform  a  valid  task  assignment.  How- 
ever, the  size  of  the  optimiil  assignment  is  equal  to  the  maximum  concurrency  of  the  graph 
T.  Since  T  can  be  less  than  T',  the  MRU  algorithm  is  only  guaranteed  to  be  optimad  for 
programs  that  contadn  no  coordination^.  There  is  the  potential  for  reducing  the  size  of 
the  task  assignment  of  programs  with  coordination. 

For  example,  Figure  3.13  shows  am  optimal  task  assignment  for  a  POEG  with  a  coordi- 
nation edge.  The  number  of  tasks  used  by  the  optimal  assignment  is  equed  to  the  maximum 
concurrency  of  four.  Because  the  MRU  algorithm  disregards  the  ancestors  obtained  from 


Figure  3.13:  POEG  with  Coordination 

the  coordination  operations,  it  must  use  six  tasks  to  perform  a  vzilid  assignment. 

The  complexity  of  performing  cin  optimal  task  zissignment  stems  from  the  presence 
of  coordination.  After  a  block  b  coordinates,  any  of  the  tasks  that  were  last  assigned  to 
either  direct  or  indirect  ancestors  of  h  can  be  validly  assigned  to  b.  In  order  to  use  tasks 


'The  simple  MRU  algorithm  is  also  snboptixnal  for  piogiams  whose  thiead  creation  and  termination 
operations  are  not  parallel  constructs. 
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assigned  to  indirect  ancestors,  the  algorithms  for  managing  the  free  task  dag  must  be 
modified  to  include  information  about  coordination  operations. 

To  do  so,  the  structure  of  a  node  in  the  free  task  dag  is  extended.  Each  node  has  three 
types  of  outnodes:  a  tree  outnode,  direct  outnodes  aind  indirect  outnodes.  The  direct 
ouinodes  are  outnodes  from  fork  and  join  operations  in  peiredlel  constructs;  the  indirect 
outnodes  aire  outnodes  obtziined  from  coordination  operations  and  unconstrained  fork  and 
join  operations.  (If  a  node  is  in  both  the  direct  and  indirect  outnode  of  ainother  node,  it 
is  deleted  from  the  indirect  set.)  The  tree  outnode  is  one  of  the  direct  outnodes.  Direct 
and  indirect  outnodes  are  always  propagated  to  the  leeif  nodes  so  that  internal  nodes  have 
only  a  single  tree  outnode. 

To  represent  coordination,  a  node  is  associated  with  each  coordination  point.  This 
node  accumulates  information  about  edl  sender  blocks  and  is  then  used  to  modify  the 
entry  point  nodes  of  all  receiver  blocks.  In  pairticular,  the  entry  point  of  a  receiver  block 
is  modified  to  have  as  indirect  outnodes  the  tree,  indirect  and  direct  outnodes  of  the 
entry  points  associated  with  its  indirect  parents.  Algorithm  3.9  shows  the  adgorithms  for 
maintziining  the  free  task  dag  when  a  coordination  operation  is  performed. 

procedure  Coordinate-Sender (b,  O) 

make  and  initialize  node  new 

new. tree  :=  nj 

new. direct  :=  ni,. direct 

new  .indirect  :=  n^. indirect 

no  .indirect  :=  no  .indirect  U  ni,. indirect  U  ni,. direct  U  nj 

for  all  n  €  ni,. direct  U  nt.indirect  do 

n.innode  :=  n.innode  +  no  +  new  —  nt, 

endfor 

ni,.innode  :=  no  +  new 

nj  :=  new 
end  procedure 

procedure  Coordinate-Receiver (b,  0) 

n^.indirect  :=  ni,.indirect  U  no  -indirect 

for  all  n  6  no -indirect  do 

n.innode  :=  n.innode  +  n\, 

endfor 
end  procedure 

Algorithm  3.9:  Coordination  Operation 

In  addition,  indirect  and  direct  outnodes  aire  propagated  to  entry  point  nodes  at  fork 
cind  join  operations.  Thus,  when  a  block  6  performs  a  fork  or  join  operation  O,  three  steps 
eire  performed:  node  nj,  becomes  the  tree  outnode  of  no,  the  indirect  and  direct  outnodes 
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of  Tib  become  the  indirect  and  direct  outnodes  of  no ,  and  each  child  c  of  no  tremsfers  the 
indirect  and  direct  outnodes  of  no  to  n^.  Algorithm  3.10  shows  the  modified  adgorithms 
for  maintciining  the  free  task  dag. 

procedure  Fork(b,  O,  cj . .  .cw) 
no  :=  nt, 

no  free-tasks  :=  no  free-tasks  U  task 
for  all  children  blocks  ci . .  .cw  do 
make  and  initialize  node  ne^ 
ncftree  :=  no 

no-innode  :=  no-innode  +  nc^ 
nc^.  direct  :=  no -direct 
ne^.indirect  :=  no -indirect 
for  all  n  €  no -direct  U  no  -indirect  do 
n.innode  :=  n.innode  —  ni,  +  n^ 
endfor 
endfor 
end  procedure 

procedure  Join(h,  O,  c) 
if  no  =  null  node  then 

make  and  initialize  node  no 

no-tree  :=  n^.tree 

no -direct  :=  no  direct  U  n^. direct 

Tic  :=  no 
else  no -direct  :=  no -direct  U  nf,. direct  +  nj.free 
no.in(i»recl  :=  no •  indirect  U  n\,.indirect 
for  all  n  G  n;,. direct  U  nj,. indirect  U  n^.tree  do 

n.innode  :=  n.innode  -\-  no  —  ni, 
endfor 

no -free-tasks  :=  no  -free-tasks  U  to5ib 
remove  node  nj, 
end  procedure 

Algorithm  3.10:  Modified  Free  Taisk  Maintenzince 

To  illustrate  free  dag  management  in  the  modified  MRU  adgorithm,  Figure  3.14  shows 
two  stages  in  the  task  assignment  to  the  POEG  in  Figiore  3.13.  (Indirect  outnodes  are 
denoted  by  dashed  lines.)  Figure  3.14a  shows  the  free  task  dag  immediately  before  the 
coordination  operation.  At  this  point  in  the  execution,  tasks  3  and  4  are  available  for 
reassignment  to  block  I4  and  none  are  av£Lilable  to  block  2\.  Figure  3.3b  shows  the  changes 
made  to  the  free  task  dag  after  the  coordination  between  blocks  I4  and  2\.  Free  tasks  3 
and  4  are  now  available  to  br'h  blocks  I5  and  22- 

When  indirect  ancestors  mcluded,  the  free  tfisk  dag  correctly  represents  all  of  the 
free  tasks  available  for  assigi.-ent  to  currently  executing  blocks.  However,  the  free  task 
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{3i,4i} 


(a)  (b) 

Figure  3.14:  Stages  in  Free  Task  Dag  for  POEG  in  Figure  3.13 

dag  is  now  truly  a  dag  rather  thain  restricted  to  being  a  tree,  as  Ccin  be  seen  in  Figure 
3.14.  This  means  that  there  may  be  a  choice  of  nodes  when  assigning  a  task  to  a  given 
block.  The  goal  of  on-line  task  assignment  protocols  is  to  select  the  "correct"  node. 

A  variety  of  different  protocols  aire  possible  for  assigning  tasks.  To  preserve  the  MRU 
heuristic,  all  assignment  protocols  aie  based  on  a  topological  traversal  of  the  dag  using 
time  stamps  that  are  associated  with  nodes.  (The  time  stamp  of  a  node  n  is  one  more 
than  the  highest  time  stamp  of  the  outnodes  of  n.)  A  nadve  task  assigimaent  protocol  does 
not  differentiate  between  direct  aind  indirect  outnodes  when  performing  the  topological 
traversal;  it  simply  picks  the  node  with  the  highest  time  stcimp.  Unfortimately,  the  size  of 
the  task  assignment  performed  by  this  protocol  is,  in  the  worst  case,  linear  in  the  niunber 
of  blocks  in  the  POEG  (e.g.  0(B)).  This  can  be  much  larger  than  the  bound  of  T'  for 
the  simple  MRU  algorithm. 

Figure  3.15  shows  one  such  worst  case  assigimient.  In  this  very  unlucky  execution, 
tasks  that  are  assigned  to  the  left  pairallel  construct  at  phase  i  —  1  are  always  reassigned 
to  the  right  parallel  construct  at  phase  i.  Therefore,  new  tasks  have  to  be  created  for  the 
left  parcdlel  construct  for  every  phase  i.  The  niunber  of  tasks  used  in  the  assignment  is 
10  (approximately  T  +  y)  as  compaired  to  an  optimal  nimiber  of  6. 

A  different  assigimient  protocol  bounds  the  size  of  the  task  assignment  to  T'  by  giving 
preference  to  the  tree  or  direct  outnode  with  the  highest  time  stamp.  A  task  is  assigned 
from  an  indirect  outnode  (the  one  with  the  highest  time  stamp)  only  if  the  last  remaining 
direct  node  htis  an  empty  free  task  set.  Algorithm  3.11  shows  the  code  for  this  task 
assigimient  protocol. 

Theorem  13  proves  that  in  the  worst  case  this  protocol  will  use  as  meiny  tasks  as  the 
simple  MRU  algorithm. 

Theorem  13:   The  number  of  tasks  in  a  task  assignment  performed  by  the 
modified  MRU  algorithm  is  less  than  or  equal  to  T'. 
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Figure  3.15:  Pathologicail  Assignment 


However,  the  modified  MRU  algorithm  should  use  fewer  tasks  thain  the  simple  MRU  algo- 
rithm in  practice.  One  open  problem  in  this  cirea  is  an  on-line  task  assignment  algorithm 
that  boiinds  the  size  of  the  assignment  to  a  function  of  the  optimal  size. 

To  Ulustrate  the  benefit  gadned  from  the  modified  MRU  task  assignment  algorithm, 
consider  the  POEG  in  Figure  3.14.  When  block  22  performs  the  fork  operation,  the 
children  of  the  fork  use  tasks  3  and  4  before  creating  a  new  task.  In  this  case  the  modified 
MRU  algorithm  performs  an  optimal  assignment  for  the  POEG  in  Figure  3.13.  In  contrast, 
the  simple  MRU  algorithm  is  forced  to  generate  two  new  tasks,  5  and  6,  to  the  blocks 
Jissigned  32  and  42  in  the  optimal  assignment.  Therefore,  more  tasks  are  required  for  the 
simple  MRU  algorithm-six  tasks  as  compared  to  four-than  for  modified  MRU  algorithm. 
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procedure  Assign-Taskfrif,,  task) 
f  •■=  nt 
if  f  .free-tasks  =  0  then 

/  :=  node  in  {n.tree  U  n. direct}  with  highest  time 
if  f  .free-tasks  =  0  then 

/  :=  node  in  n. indirect  with  highest  time 
endif 
endif 
if  f  .free-tasks  -^  0  then 

pick  task  from  f  .free-tasks 
else  task  :=  new  task 
end  procedure 

Algorithm  3.11:  Modified  Task  Assignment 

3.4.4     Complexity  of  the  Modified  MRU  Algorithm 

Unfortunately,  a  relatively  large  amount  of  work  is  required  in  the  worst  case  to  maintain 
the  extended  free  task  dag.  Thus,  while  the  size  of  its  assignment  is  potentially  smadler, 
the  modified  MRU  algorithm  is  more  expensive  thein  the  simple  MRU  algorithm.  This  is 
shown  by  the  following  theorems: 

Theorem  14:  The  size  of  the  free  task  dag  used  by  the  modified  MRU  algo- 
rithm is  0((T  +  C)yi). 

Theorem  15:  The  average  cost  per  block  of  performing  the  modified  MRU 
algorithm  is  0(A  +  C). 

where  A  is  the  size  of  the  task  cissigiunent  (which  is  bounded  by  T'  by  Theorem  13)  and 
C  is  the  number  of  outsteinding  asynchronous  coordination  operations. 

These  eire  theoretical  worst  case  upper  bounds  and  are  required  only  for  certain  patho- 
logiccJ  POEGs.  In  general,  the  size  of  the  free  task  dag,  and  hence  the  cost  for  task 
cissignment,  is  much  less  than  these  bounds.  In  fact,  the  complexity  of  the  free  task  dag 
grows  £ts  a  function  of  the  amotm.t  of  coordination.  For  programs  with  no  coordination, 
the  simple  and  modified  MRU  algorithms  have  equal  space  emd  time  complexities. 

Theorem  13  The  number  of  tasks  in  a  task  assignment  performed  by  the  modified  MRU 
algorithm,  is  less  than  or  equal  to  T'. 

Proof.  Each  block  bj  that  is  assigned  from  an  indirect  outnode  n  can  force  at  most 
one  block  6,  to  create  a  new  task.  However,  bj  will  use  a  task  from  an  indirect  outnode 
only  if  no  free  task  was  last  assigned  to  a  direct  £incestor  of  bj.  If  bj  did  not  use  ein 
indirect  outnode,  as  in  the  simple  MRU  algorithm,  it  would  be  forced  to  create  a  new 
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task  immediately.  It  follows  that  the  number  of  taisks  used  by  the  modified  MRU  algorithm 
is  at  most  the  number  of  tasks  used  by  the  simple  MRU  algorithm  which  is  T'.  D 

Theorem  14  The  size  of  the  free  task  dag  used  by  the  modified  MRU  algorithm  is  0((T  + 
C)A). 

Proof.  There  can  be  at  most  A  internal  nodes  since  every  such  node  must  have  at  least 
one  free  task  associated  with  it.  There  are  still  at  most  T  entry  point  nodes  eind  C 
outstcinding  asynchronous  coordination  operations.  By  construction,  the  internal  nodes 
of  the  free  task  dag  adways  form  a  tree.  Therefore,  only  leaif  nodes  have  more  than  one  out 
edge.  The  total  nimiber  of  nodes  in  the  free  task  dag  is  0{A  +  C),  the  number  of  edges  is 
0{{T  +  C)A),  and  hence  the  total  size  of  the  free  task  tree  is  bounded  by  0{{T  +  C)A).     D 

Theorem  1 5    The  average  cost  per  block  of  performing  the  modified  MR  U  algorithm  is 

Proof.  As  in  the  proof  of  Theorem  12,  we  consider  the  operations  that  must  be  performed 
for  each  block:  an  assignment  or  coordination  and  (possibly)  a  coUapse  operation  when 
a  block  is  created,  amd  a  fork,  join  or  coordination  and  (possibly)  a  subsimae  operation 
when  a  block  is  terminated.  The  cissignment,  fork  and  join  operations  require  0{A)  work. 
The  coordination  operation  cdso  requires  0{A)  work;  a  globed  outnode  set  is  maintained 
for  every  coordination  point  to  which  every  sender  bloci.  adds  its  outnodes,  and  that  is 
copied  to  each  receiver  block's  entry  point  after  eill  coordinating  blocks  have  updated  it. 
Since  the  internal  structure  of  the  free  task  dag  is  a  tree,  all  collapsed  and  subsimied 
nodes  n  have  a  single  outnode.  Therefore,  at  most  0(T  +  C)  work  is  required  to  update 
the  tree  outnode  aind  direct  and  indirect  outnode  lists  in  which  n  appears.  D 

3.5     Extensions  to  Task  Recycling 

There  aire  two  primary  ways  in  which  the  techniques  developed  for  the  task  recycling 
algorithm  ccin  be  applied  to  other  issues  in  debugging  parallel  programs.  One  method  is 
to  use  the  set  of  anomcilies  reported  to  increase  the  reliability  of  trace-and-replay  based 
debuggers.  The  second  way  to  extend  event-based  debuggers  to  consider  both  operationaJ 
and  causality  ordering  constreiints  when  detecting  general  race  conditions. 

3.5.1      Trace- and- Replay 

Trace-and-replay  is  one  technique  of  debugging  pairallel  progrcims.  In  sequential  programs, 
two  execution  instances  of  the  same  program  on  the  same  input  vector  are  guaranteed  to 
be  identical.  (Reccdl  that  we  have  a  very  liberal  definition  of  an  input  vector.)  Because  of 
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the  presence  of  race  conditions,  this  property  does  not  hold  for  pciicdlel  progrfims.  The  goal 
of  trace-zind-replay  debuggers  is  to  store  enough  information  about  the  race  conditions 
in  a  specific  execution  instance  Et  of  a  pairadlel  program  and  input  vector  pair  (P, /)  to 
guMJintee  that  a  subsequent  re-execution  Er  of  P  on  7  is  identical  to  Et-  Reproducibility 
is  achieved  by  nuiintaining  enough  information  about  the  execution  state  when  an  event  e 
was  performed  in  Et  so  that  Et  can  be  reliably  reproduced  by  forcing  some  set  of  events 
to  be  executed  before  every  e  in  Er. 

Instant  Replay  is  a  trace- and-replay  debugger  developed  by  Fowler,  LeBlemc  and 
Mellor- Cnimmey  [LMC87,FLMC88].  In  Instaint  Replay,  information  about  the  execu- 
tion instance  is  recorded  in  the  process  history  tape  associated  with  each  process.  Every 
time  a  coordination  operation  is  performed,  a  record  is  added  to  the  history  tape  of  the 
coordinating  process.  Process  history  tapes  are  used  during  the  replay  phase  to  control  the 
execution  of  the  progreim.  Specifically,  am  execution  instcince  is  reproduced  by  constrain- 
ing all  coordination  operations  to  occur  in  the  same  order  in  a  subsequent  re-execution. 

The  primary  drawback  of  the  Instant  Replay  system  is  that  it  considers  only  race 
conditions  in  synchronization  and  coordination  operations.  (In  other  words,  those  races 
that  lead  to  POEG  nondeterminism.)  However,  access  cinomalies  are  einother  type  of  race 
condition  that  can  affect  the  execution  of  the  prograim  and  hence  must  be  traced  and 
replayed  in  order  to  ensure  reproducibility. 

Reproducibility  of  access  to  a  shared  variable  X  ccin  be  guaremteed  by  forcing  the  same 
set  of  conflicting  accesses  to  X  to  appear  in  both  Et  and  Er.  However,  the  technique 
for  monitoring  shared  variable  accesses  should  be  somewhat  different  from  the  approach 
used  for  coordination  operations,  since  it  is  undesirable  to  trace  every  access  to  every 
shared  variable  when  only  the  execution  of  access  cinomalies  must  be  constrained.  Instead, 
information  about  access  anomalies  is  stored  in  a  variable  history  tape  eissociated  with  each 
process.  Each  process  P  also  has  a  step  counter  that  is  incremented  every  time  P  accesses 
a  monitored  vEiriable.  Each  monitored  variable  X  has  an  associated  version  number  that 
is  incremented  every  time  X  is  written. 

During  the  trace  execution  of  the  program,  Et,  anomalies  are  detected  using  the 
task  recycling  algorithm.  A  record-which  consists  of  a  step  count,  version  nimaber  cind 
auxiliary  field-is  written  to  a  variable  history  tape  whenever  the  veiriable  is  accessed  in  ein 
unsafe  manner.  Because  only  information  about  access  anomedies  is  written  to  the  tape, 
the  size  of  the  tape  will  generally  be  much  smedler  than  a  trace  of  aJl  accesses  to  shared 
variables. 

SpecificjJly,  when  a  process  P  performs  a  read  or  write  operation  that  conflicts  with 
a  preceding  write  operation,  the  version  nimiber  and  current  step  of  P  are  written  to  the 
Vciriable  access  tape  of  P.   When  a  process  P  performs  a  write  operation  that  conflicts 
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with  a  set  of  preceding  read  operations,  the  version  number  and  current  step  of  P  as 
well  as  the  the  number  of  conflicting  are  recorded  in  the  variable  access  tape  of  P.  In 
addition,  the  version  nxunber  and  step  of  each  conflicting  read  event  is  written  to  the 
variable  history  tape  of  the  process  that  performed  the  read  event.  The  algorithms  for 
recording  the  trace  information  are  shown  in  Algorithm  3.12. 

procedure  Trace- Read- Eveniih,  X) 

step{h)  :=  si€p{h)  +  1 

if  IsConcurrentfb,  Writer (X))  then 

write  [atep[b),  version(X),  NULL]  to  tope(6) 

endif 

Subtract- Read(b,  X) 
end  procedure 

procedure  Trace- Write- Eveni{b,  X) 
3tep{b)  :=  atep(b)  +  1 
count  :=  0 
for  all  a  in  Reader(X)  do 

if  IsConcurrentfb,  a)  then 

write  [atep{a),  version(X),  -1]  to  tape[a) 
count  :=  count  +  1 
endif 
endfor 
if  hConcurrent(b,  Writer (X))  or  count  ^  0  then 

[siep(b),  version(X),  count]  to  tap€{b) 
endif 

version(X)  :=  version(X)  +  1 
Subtract-Write{b,  X) 
end  procedure 

Algorithm  3.12:  Tracing  Read  and  Write  Events 

Since  the  TVnce-  VTnite-^'vent  routine  writes  information  to  the  other  processes'  variable 
history  tapes,  the  records  in  a  variable  history  tape  cam  be  out  of  order.  Before  the  replay 
phase  each  vzo-iable  history  tape  is  sorted  by  the  step  field.  A  read  event  that  is  involved 
in  access  anomalies  with  two  write  operations  will  have  two  identical  tape  records.  In  this 
case,  one  of  the  redundant  entries  is  deleted. 

During  the  replay  phase,  whenever  a  process  P  accesses  a  monitored  vairiable,  the 
current  step  of  P  is  incremented  and  compared  against  the  step  count  at  the  head  of  the 
vjiriable  history  tape  of  P.  If  the  current  step  is  less  than  the  tape  step,  P  can  continue 
execution.  Otherwise,  the  current  access  was  involved  in  an  anomaly  eind  must  wait  for 
the  state  of  the  execution  to  be  the  same  as  in  the  prior  execution. 

In  particular,  P  cannot  continue  execution  until  the  version  of  JT  in  Er  is  the  same 
as  the  version  in  Et-    If  the  current  access  is  a  write,  the  number  of  conflicting  read 
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operations  that  happened  before  the  current  write  operation  must  also  be  the  szinie  in 
Er.  This  second  check  guarantees  that  all  concurrent  read  operations  that  are  supposed 
to  read  the  previous  version  of  X  are  completed  before  the  new  version  of  X  is  written. 
The  aJgorithms  for  the  replay  phase  £ire  shown  in  Algorithm  3.13. 

procedure  Replay- Read- Event(b,  X) 
begin 

step{b)  :=  step(b)  +  1 

if  8tep{b)  =  step(tape(b))  then 

while  ver3ion{X)  ^  ver3ion{tape(b))  do  loop 
read  X 

if  aux(iap€(b))  =  -1  then  count(X)  :=  count(X)  +  1 
move  to  next  square  on  tape(b) 
else  read  X 
end  procedure 

procedure  Replay- Write- Event{b,  X) 
begin 

step{b)  :=  step{b)  +  1 

if  step{b)  =  step{tap€{b))  then 

while  ver3ion(X)  ^  version{tape(b))  and  count{X)  ^  aux(tape(b))  do  loop 
write  X 
count{X)  :=  0 

veTsion(X)  :=  veTsion(X)  +  1 
move  to  next  square  on  tape(6) 
else  write  X 
end  procedure 

Algorithm  3.13:  Replay  Read  and  Write  Events 

Theorem  16  proves  that  this  system  is  sufficient  for  tracing  Jind  replaying  accesses  to 
shared  variables. 

Theorem  16:  Every  read  event  gets  the  same  vailue  in  trace  execution  Et 
cind  replay  execution  Er. 

The  amount  of  replay  control  data  Jind  overhead  is  proportional  to  the  nimiber  of  access 
anomalies,  rather  than  the  niunber  of  accesses  to  shared  variables.  In  addition,  the  step 
coimt  must  be  incremented  for  every  access  to  every  sheired  variable  that  has  at  least  one 
access  anomaJy. 

Lemma  7  All  write  events  happen  in  the  same  order  in  trace  execution  Et  and  replay 
execution  Er. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  write  event  Wi  to  shaured  variable  X 
happened  before  write  Wj  to  X  in  Ej  and  aifter  Wj  in  Er.  Let  W  =  Wi,  Wi^i,  . .  .Wj  be 
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the  sequence  of  write  operations  that  occurred  in  Ej-  Consider  a  write  operation  W),  G 
W.  If  Wk+i  yf&s  performed  by  a  descendant  of  Wk  in  Et,  then  Wk+i  must  also  happen 
after  Wk  in  Er.  If  Wk  was  concurrent  with  Wk+i  in  JSj,  then  there  is  a  record  in  the 
variable  history  tape  of  the  process  executing  Wk+i  that  forces  it  to  execute  eifter  Wk  in 
Er.  Therefore,  for  aiU  i  <  ^  <  j,  write  Wk+\  must  happen  tifter  Wk  in  Er,  and  hence,  Wj 
must  happen  after  Wi  in  Er.  D 

Lemma  8  All  read  events  happen  after  the  same  write  events  in  trace  execution  Et  and 
replay  execution  Er. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  a  read  event  r  of  sheared  vjiriable  X 
happened  immediately  after  write  event  Wi  of  X  (e.g.  before  tUj+i )  in  Et,  but  before  Wi 
in  Er.  Read  event  r  must  be  concurrent  with  u;,-.  Therefore  r  was  detected  as  an  access 
anomaly  with  respect  to  Wi  in  Et  and  there  is  a  record  in  the  variable  access  tape  of  the 
process  performing  r  that  forces  r  to  execute  only  aftet  Wi  is  performed.  D 

Lemma  9  All  read  events  happen  before  the  same  vrrite  events  in  trace  execution  Et  and 
replay  execution  Er. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  a  read  event  r  of  shaired  variable  X 
happened  immediately  before  Wi  of  X  (e.g.  after  u;,_i)  in  Et,  but  after  vui  in  Er.  Read 
event  r  must  be  concurrent  with  Wi.  By  Lemma  8,  no  read  event  that  happened  eifter  vui 
in  Et  will  happen  before  Wi  in  Er.  The  last  read  event  r'  performed  by  a  descendant  of 
r  that  occurred  before  w,  was  detected  as  an  access  anomaly  with  respect  to  Wi  in  Et- 
Therefore,  r'  is  guarainteed  to  happen  before  Wi  in  Er.  Otherwise,  the  count  field  will  not 
be  Ijirge  enough  to  let  ti;^  proceed.  Thus,  w^  will  not  continue  in  Er  until  r'-and  therefore 
its  ancestor  r  if  r  7^  r'-has  executed.  D 

Theorem  16  Every  read  event  gets  the  same  value  in  trace  execution  Et  and  replay 
execution  Er. 

Proof.  This  follows  directly  from  Lemma  7,  Lemma  8  ajid  Lemma  9.  0 

3.5.2      General  Race  Conditions 

Event  based  debugging  is  a  technique  for  detecting  race  conditions  of  genered  events 
in  paradlel  and  distributed  programs.  Event-based  debugging  was  first  proposed  by 
Bruegge  [Bru85,BH83]  and  has  been  used  as  the  basis  of  many  later  debugging  systems 
[Bat88,BW83,HK88,Sto88].  (Access  zinomalies  are  a  very  specific  type  of  race  condition  in 
which  the  only  possible  events  are  read  and  write  operations.)  To  detect  general  race  con- 
ditions, the  set  of  events  to  monitor  and  the  notion  of  conflicting  events  must  be  explicitly 
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defined  by  the  prograiminer.  (In  access  anomzJy  detection  systems,  the  set  of  events  and 
notion  of  conflicting  events  cire  defined  by  Bernstein's  Conditions.) 

Most  event-based  debugging  systems  use  mechanisms  based  on  path  expressions-a 
protocol  proposed  by  Ceimpbell  zmd  Habermann  [CH73,Hab75]-for  specifying  the  vjJid 
ordering  of  events.  The  syntax  emd  semantics  of  path  expressions  are  shown  in  Figure 
3.16.  An  execution  instance  is  valid  with  respect  to  a  path  expression  P  if  and  only  if  it 

pe  :=  event 

I  pei  +  pej  selection  -  either  pei  or  pej  may  execute,  but  not  both 

I  pci  ;  pej  sequencing  -  pei  must  execute  before  pej 

I  pe*  repetition  -  pe  may  execute  zero  or  more  times 

Figure  3.16:  Syntax  and  Semantics  of  Path  Expressions 

corresponds  to  a  string  in  the  language  defined  by  P.  Event-based  debuggers  use  a  set  of 
path  expressions  to  fully  specify  the  desired  ordering  constraints. 

Event-based  debuggers  guarcintee  that  the  following  operational  constrzdnt  on  the 
execution  order  of  events  is  maintained  during  the  execution  of  a  parzJlel  program: 

Operational  Constraint:   Event  e  is  valid  if  and  only  if  e  is  a  valid  next 
symbol  with  respect  to  all  path  expressions  in  which  e  appears. 

In  other  words,  they  check  that  the  sequence  of  operations  performed  is  vzJid  with  respect 
to  all  path  expression  constraints.  A  weakness  of  event  based  debugging  systems  is  that 
no  attempt  is  made  to  verify  that  the  operational  constraint  is  enforced  by  the  semantics 
of  the  parcillel  prograim^. 

To  illustrate  this  problem,  consider  a  pzirallel  progrcim  in  which  there  is  a  producer 
and  a  consimier  with  a  one-unit  buffer.  The  consimaer  may  proceed  only  after  something 
has  been  produced,  and  the  producer  may  proceed  only  if  its  previous  output  has  been 
consumed.  This  behavior  is  checked  by  the  path  expression  below: 

path  produce;  consume  end 

where  path  . . .  end  is  a  repetition  operator.  The  following  execution  sequence, 

produce,  consume,  produce,  consume,  ••• 

meets  the  operationail  constrednts  imposed  by  the  above  path  expression.  Therefore,  this 
execution  would  be  considered  Vcdid  by  an  event-based  debugging  system.  If  the  first 
produce  event  was  not  causally  ordered  with  the  consume  event,  however,  a  race  condition 


*The  system  of  Hseush  and  Kaiser  addresses  this  problem  by  requiring  exact  ordering  relationships 
[HK88,HK90]. 
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exists  which  was  not  detected  because  the  two  events  simply  happened  to  occur  in  the 
correct  order. 

The  notions  used  in  task  recycling  can  be  encorporated  into  event-based  debugging 
systems  to  improve  the  race  condition  detection.  Specifically,  access  histories  and  a  POEG 
aie  used  to  verify  that  the  following  causality  constrsiint  is  also  met. 

Causality  Constraint:  Event  e  must  not  be  concurrent  with  any  event  that 
is  required  to  precede  event  e  by  the  semantics  of  the  path  expressions. 

Thus,  race  conditions  will  be  detected  regcirdless  of  whether  or  not  the  events  happen  to 
be  performed  in  the  correct  order  in  a  given  execution  instance.  By  reqtiiring  that  both 
operationaJ  and  causcdity  constraints  are  met,  race  conditions  detection  is  strengthened 
in  these  systems. 

The  approach  of  checking  causality  ordering  is  fairly  independent  of  the  mechanism 
for  specifying  cind  checking  operational  ordering  constreiints.  Whatever  rules  are  applied 
to  verify  operational  ordering  can  be  modified  to  verify  causality  ordering  as  long  as  we 
can  associate  an  access  history  with  each  operational  ordering  constraint.  The  remainder 
of  this  section  describes  the  algorithm  for  checking  causeility  ordering  for  path  expressions. 

Because  a  path  expression  is  a  regular  expression,  it  has  a  corresponding  deterministic 
finite  automaton.  Each  event  in  a  path  expression  is  represented  by  a  transition  and  each 
";"  sequencing  operator  by  a  state.  A  state  s  becomes  selected  when  an  event  associated 
with  an  in-transition  of  s  executes;  s  becomes  unselected  when  an  event  associated  with  cin 
out-trjinsition  of  a  executes.  All  of  the  out-transitions  of  the  selected  state  are  permissible. 
An  event  e  can  execute  if  and  only  if  there  is  a  permissible  transition  labeled  e  in  all  of 
the  path  expressions  in  which  e  appears. 

As  in  access  anomalies  detection,  access  histories  contaiin  the  identifiers  of  blocks  and 
are  checked  and  unneeded  entries  subtracted  whenever  an  event  is  performed.  In  event- 
based  debugging,  the  access  history  is  distributed  by  associating  entries  with  states  in  the 
automata.  The  access  history  entry  of  state  5  contains  the  identifier  of  the  block  which 
last  selected  5.  When  a  block  6  executes  an  event  e,  block  6  checks  if  it  ..  concurrent 
with  any  block  in  the  access  history  entry  associated  with  the  head  states  of  each  of  its 
permissible  trjinsitions.  If  so,  a  race  condition  exists.  Block  b  then  adds  itself  to  the  tail 
states'  access  histories.  Algorithm  3.14  shows  the  algorithm  for  verifying  valid  operation 
ordering  and  checking  causality  constreiints. 
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procedure  Check- Event{e,b) 

if  I  permissible(e)  \  <  paths-in{e)  then 

report  OpeTutional  Constraint  Error 
else 

for  all  1  in  penni3sible-trans(e)  do 

if  Is  Concurrent  (b,  access-history(t.head))  then 

report  Causality  Constraint  Error 
endif 
for  all  transitions  s  in  out-trans(t.head)  do 

delete  «  from  permt44i6/e(j. event) 
endfor 
for  all  transitions  s  in  out-trans{t.tail)  do 

add  «  to  pei~mis3ible(s. event) 
endfor 

oecea*-/iw<ory(t.tat7)  :=  b 
end  for 
end  if 
end  procedure 

Algorithm  3.14:  Check  Operationcil  and  Causality  Constraints 
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Chapter  4 

Empirical  Measurements  of  Task 
Recycling 


Chapter  3  presented  the  task  recycling  algorithm  for  detecting  access  anomalies  in  pairallel 
programs.  We  can  anailyticcilly  parameterize  the  overhead  associated  with  task  recycling. 
For  instance,  the  cost  incurred  whenever  a  monitored  vjiriable  X  is  written  is  proportional 
to  the  nimaber  of  concurrent  blocks  that  have  read  X;  in  the  worst  case,  this  cost  is 
equal  to  the  maTimnm  concurrency  of  the  program.  Similar  bounds  exist  for  maintcdning 
concurrency  information.  GenerjJ  discussions  of  algorithmic  overhead,  however,  give  little 
insight  into  the  actual  costs  incurred  when  detecting  zmomalies  in  pairallel  programs. 

To  better  evaluate  the  task  recycling  algorithm-as  well  eis  the  general  approach  on- 
the-fly  detection  of  access  anomalies-the  task  recycling  algorithm  was  implemented  for 
a  parallel  Fortran  environment  on  the  NYU  Ultracomputer.  In  order  to  obtain  relative 
performaince  mecisurements,  ain  adtemative  on-the-fly  technique  known  as  English- Hebrew 
labeling  [NE.88a,NR88b]  was  also  implemented.  Implementation  of  these  two  algorithms 
is  discussed  in  Section  4.1. 

Measurements  were  gathered  for  each  of  the  following  benchmark  parallel  programs: 

Triso  -  Solves  a  sparse  triangulai  linear  system  equations  using  wavefront 
pairallelism 

Finite  -  Solves  a  lineeir  system  using  finite  element  methods 

Simple  -  Solves  partial  differential  equations  for  hydrodynamics  and  heat 
conduction 

Polymer  -  Performs  molecular  dynamic  calculations  of  polymer  systems 

Section  4.2  presents  information  about  the  parallel  structure  and  access  patterns  of  the 
benchmark  programs.  The  behavior  seen  in  the  benchmark  progrcims  justifies  design 
decisions  made  in  the  development  of  the  task  recycling  approach  £ind  inspired  several 
optimizations. 


Section  4.3  discusses  the  overhead  (mezisured  in  space,  elapsed  nrnning  time  and  CPU 
execution  time)  for  monitoring  parallel  programs  using  the  task  recycling  aind  English 
Hebrew  labeling  techniques. 

The  experimental  data  indicate  four  important  results: 

1.  The  benchmeirk  programs  use  data  partitioning  so  extensively  that  over  80%  of  till 
variables  never  have  more  thzin  two  concurrent  readers.  Therefore,  the  size  of  an 
access  history  is  generally  very  small  cind  independent  of  the  degree  of  parallelism 
of  the  program. 

2.  For  the  benchmark  programs,  monitoring  entails  a  150%  to  350%  slowdown.  Al- 
though this  cost  is  high,  it  is  not  tmreasonable  during  the  debugging  phase  of  pro- 
gram development. 

3.  Because  of  its  efficient  concurrency  information  management,  English-Hebrew  label- 
ing performs  substeintiaUy  better  than  xmoptimized  task  recycling  on  programs  with 
frequent  parallel  operations  and  relatively  few  accesses  to  shared  variable.  However, 
optimizations  czm  be  used  to  reduce  the  costs  incurred  by  task  recycling  to  be  pro- 
portional to  the  complexity  of  the  concurrency  structure  and  thus,  comparable  to 
the  cost  inciirred  by  English-Hebrew  labeling. 

4.  As  reader  sets  or  concurrency  lists  increase  in  size,  the  high  cost  of  performing 
concurrency  verification  in  English-Hebrew  labeling  outweighs  the  benefits  of  its 
efficient  concurrency  information  maintenance. 

If  the  benchmeirk  progremis  aie  indicative  of  a  wide  class  of  peiredlel  progreims,  the  task 
recycling  ailgorithm  is  ein  important  improvement  over  existing  technology. 

4.1      Implementation 

Task  recycling  iind  English-Hebrew  labeling  have  been  implemented  for  the  pcirzJlel  For- 
trsm  environment  on  the  NYU  Ultracomputer  [Ber88,Got88].  The  following  sections  de- 
scribes the  system  for  instrumenting  paradlel  programs  and  the  implementation  details  for 
the  English-Hebrew  labeling  eind  task  recycling  techniques.  Because  of  the  fairly  simple 
concurrency  structure  of  the  benchmark  Fortran  programs,  several  additional  implemen- 
tation optimizations  were  possible  and  eire  described  below. 

4.1.1      Instrumentation  System 

The  version  of  paredlel  Fortran  available  on  the  Ultracomputer  is  based  on  Fortrcin-77. 
It  supports  deal!  and  parallel  case  constructs  for  creating  peiraUel  threads;  barriers  and 
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PROGRAM  EIMPLE 

SHARED   /GLOBAL/   A(4) ,    I 

INTEGER*4  I,    J 

1  =  0 

DOALL  I  =   1,3,1 

A(I)   =1*1 

J   =  X  +  A(I+1) 
ENDALL 
END 

Figttre  4.1:  Example  of  Fortran  Code 


critical  sections  are  the  primary  forms  of  coordination.  Figure  4.1  shows  a  simple  parallel 
Fortran  progT£im.  All  shared  variables  are  explicitly  defined  by  specifying  a  common  block 
to  be  SHARED.  The  DOALL  construct  creates  three  concurrent  iterates. 

A  simple  front-end  preprocessor  is  used  to  instrument  a  parallel  program.  The  pre- 
processor approach  makes  the  system  relatively  portable  to  other  pairallel  Fortran  system 
£ind  simplified  the  implementation.  However,  the  resulting  monitoring  efficiency  could 
be  improved.  A  more  complex  preprocessor  or  compiler-based  front-end  can  use  static 
analysis  or  user  directives  to  reduce  the  monitoring  cost  in  several  ways.  First,  only  those 
accesses  flagged  as  potential  access  anomalies  by  static  analysis  need  to  be  monitored 
during  run-time.  Second,  monitoring  code  could  be  moved  out  of  a  loop  body  as  long  as 
there  eire  no  thread  creation,  termination  or  coordination  operations  within  the  loop.  In 
addition,  concurrency  information  has  to  be  maiintiiined  only  in  portions  of  the  prograim 
where  variable  access  monitoring  code  is  present.  In  this  implementation,  every  access  to 
every  shared  variable  aind  every  pziraUel  operation  is  monitored. 

The  preprocessor  allocates  storage  for  access  histories  cind  concurrency  state  informa- 
tion. Every  monitored  variable  has  a  "mirror"  variable  that  contains  its  access  history. 
The  name  of  the  access  history  is  simply  the  name  of  the  vziriable  suffixed  by  a  "Thus, 
when  a  block  reads  variable  A[i],  the  access  history  in  A%[i]  is  checked  for  anomalies  and 
then  updated.  Each  entry  in  am  access  history  contains  a  tzisk  identifier  or  a  pointer  to  the 
Enghsh-Hebrew  label.  It  cdso  contains  the  line  number  and  a  pointer  to  the  fimction  name 
where  the  last  access  was  performed.  This  information  enables  us  to  identify  the  locations 
in  the  program  code  where  access  atnomalies  occur  and  produce  useful  error  messages. 

Due  to  memory  constraints  on  the  Ultracomputer,  the  reader  sets  in  the  access  histories 
are  of  fixed  length.  In  particular,  the  benchmairk  programs  were  monitored  using  reader 
sets  with  one  and  two  entries.  While  this  limitation  may  result  in  undetected  anomalies, 
results  discussed  in  Section  4.2.2  indicate  that  anomalies  will  be  missed  only  rarely.  In 
fact,  for  half  of  the  benchmark  programs  a  reader  set  of  size  two  is  sufficient  to  guarantee 
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that  an  emomaly  is  detected  for  every  variable  that  is  accessed  in  cin  vinsafe  manner. 

The  preprocessor  inserts  calls  to  library  routines  for  maintcdning  access  histories  when- 
ever a  shared  Vciriable  is  accessed  and  for  updating  concurrency  information  at  every 
thread  creation  and  termination  point.  The  routines  for  checking  access  histories  and  per- 
forming subtraction  on  access  histories  are  written  in  assembler  for  reasons  of  efficiency. 
The  task  recycling  £ind  English  Hebrew  labeling  access  anomzdy  detection  iilgorithms  aie 
implemented  by  two  librsiry  packages  written  in  C.  The  instrumented  progrzim  is  linked 
to  a  separate  library  of  instnmaented  coordination  routines-n£imely,  bcirrier  and  critical 
section  coordination-that  update  concurrency  information  at  coordination  points. 

Figure  4.2  shows  the  code  generated  by  the  preprocessor  for  the  program  in  Figure  4.1. 
The  programmer  uses  a  MONITOR  directive  to  specify  which  sheired  vziriables  to  monitor. 

PROGRAM  EIMPLE 

SHARED  /GLOBAL/  A(4) .   X 

shared  /global'/./  Ay.(9,4).  xy.(9) 

INTEGER*4  I 

integer *4  A'/., I*/. 

common/myfunct/  myfun 

cheiracter*16  mylun 

call  initdsO 

call  setstr (myfun,    'exmple') 

call  writxx(X'/.,5) 

X  =  0 

call  mnfrkpO 

DOALL  I  =  1,3,1 

call  mnfrkcQ 

call  writxx(A(l,i)y.,7) 

A(I)   =1x1 

call  readxx(Xy.,8) 

call  readxx(A(l,i+l)y,,8) 

J  =  X  +  A(I+1) 

call  mnjoncO 
ENDALL 

call  nnjonpO 
END 

Figure  4.2:  Exemiple  of  Instrxmaented  Fortran  Code 

In  Figure  4.2,  the  GLOBAL  common  block  was  monitored.  Some  of  the  inefficiencies  of 
the  preprocessor  approach  Cein  be  seen  in  that  variable  X  is  monitored  during  a  portion 
of  sequential  execution.  This  unnecessfiry  code  would  not  be  added  if  a  front-end  system 
were  used  to  first  locate  potential  access  anomalies. 

The  following  is  the  anonudy  report  for  one  possible  execution  instance  of  the  program 
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in  Figure  4.2  : 

Write-Read  anomaly  for  A(3) :      in  exmple.T  emd  exmple.S 
Read-Write  iinoinaly  for  A(2) :      in  exmple.S  and  exmple.7 

In  this  execution  instance,  iterate  2  completed  before  iterates  1  and  3  began  executing. 
Iterate  2  added  a  read  event  to  A'/,  (3)  and  a  write  event  to  A'/, (2).  When  iterate  3  executed, 
it  detected  that  its  write  of  A(3)  on  line  7  conflicted  with  a  prior  read  of  A(3).  When 
iterate  1  executed,  it  detected  that  its  read  of  A  (2)  on  line  8  conflicted  with  a  prior  write. 

4.1.2      English  Hebrew  labeling 

English- Hebrew  labeling  is  an  on-the-fly  anomaly  detection  scheme  developed  by  Nudler 
2ind  Rudolph  [NR88a,NR88b].  In  English- Hebrew  labeling,  the  direct  ancestor  relationship- 
that  is  the  partial  ordering  introduced  by  parallel  constructs-is  encoded  in  a  tag  cissociated 
with  each  block.  A  tag  consists  of  a  pedr  of  labels:  an  English  label  E  and  a  Hebrew  la- 
bel H.  Conceptuadly,  the  English  label  is  produced  by  performing  a  left-to-right  preorder 
numbering  of  the  POEG;  each  block  is  assigned  a  number  less  thcin  all  of  the  niunbers 
cissigned  to  its  children  2ind  its  siblings  to  the  right.  The  Hebrew  label  is  produced  by 
performing  a  right-to-left  nvmibering. 

Since  the  labels  must  be  generated  on-line,  a  complete  traversad  of  the  POEG  cannot 
be  performed.  Therefore,  a  label  is  a  string  of  nimibers  and  labels  are  lexicographically 
ordered.  The  children  blocks  cq  •  •  •Cm  of  a  fork  vertex  with  parent  block  p  aie  assigned 
English  and  Hebrew  labels  as  follows: 

fork:         E{tag(ci))  :=  E{tag{p))  \  i 

H{ta9(ci))  :=  H(tag{p))  \  m  -  i  +  1 

where  |  is  the  append  operation  and  z  is  a  unique  child  ntmiber.  The  child  block  c  of  a 
join  vertex  with  parents  po  •  •  'Pm  is  assigned  the  English  and  Hebrew  labels: 

join:         E{tag{c))  :=  Tnax{E(iag{pi))) 
H{tag(c))  :=  max{H{tag(pi))) 

A  block  c  created  in  a  coordination  operation  that  has  parent  p  is  assigned  English  and 
Hebrew  labels: 

coordination:  E(tag(c))  :=  E(tag(p))  \  1 

H(tag(c))  :=  H(tag(p))  \  1 

As  specified  above,  the  length  of  the  labels  increase  with  the  number  of  fork  and  coordi- 
nation operations.  However,  an  additional  heuristic  described  in  [NR88a,NR88b]  bounds 
the  label  length  to  the  level  of  nesting,  0{N). 

Two  tags  ti  and  tj  aie  unordered  if  and  only  if  the  followiog  condition  is  met: 

(E(U)  <  E(t,)  and  H(U)  >  H(ij))  or  (E(U)  >  E(t,)  and  H(U)  <  H{tj)) 
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The  tags  of  a  block  £ind  any  of  its  direct  £incestors  or  descendaints  axe  always  ordered. 
However,  English- Hebrew  tags  only  encode  concurrency  properties  for  the  fork  and  join 
operations.  When  two  blocks  are  not  concurrent  in  the  POEG  because  of  explicit  coordi- 
nation, their  tags  are  unordered.  Coordination  lists  sltc  used  to  record  execution  orderings 
imposed  by  coordination. 

Specifically,  a  coordination  list  is  associated  with  each  executing  block  b  to  store  the 
tags  of  the  indirect  eincestors  of  b.  All  tags  in  a  coordination  list  are  unordered,  so  that 
the  length  C  of  a  coordination  list  is  bounded  by  0(T).  The  test  for  concurrency  between 
a  block  b  Jind  tag  t  requires  determining  if  tag{b)  or  any  of  the  tags  in  the  coordination 
list  of  b  are  ordered  with  t.  Algorithm  4.1  shows  the  code  for  detecting  if  block  b  and  tag 
t  are  concurrent. 

procedure  IsConcuTrent(b,  t) 

if  E{tag(b))  >  E{t)  and  H{tag{b))  >  H{t)  then  return  false 
else 

for  all  tags  c  is  coordinaiion-list(b)  do 

if  E{c)  >  E{t)  and  H{c)  >  E{i)  then  return  false 
end  for 
end  if 
return  ti^e 
end  procedure 

Algorithm  4.1:  English- Hebrew  Concurrency  Check 

Figure  4.3  illustrates  the  use  of  English- Hebrew  labels  for  a  POEG.  The  coordination 
list  of  block  1131,1231  contains  the  tag  121,113  because  121,113  is  an  indirect  ancestor 
of  1131,1231.  Likewise,  the  coordination  lists  of  block  121,113  contains  the  tag  113,123. 
We  ccin  determine  that  blocks  1131,1231  and  11,12  are  not  concurrent  because  their  tags 
are  ordered;  namely,  1131  >  11  and  1231  >  12.  Similarly,  blocks  1131,1231  and  12,11 
are  not  concirrrent  because  a  tag  in  the  coordination  list  of  1131,1231-neimely,  121,113- 
is  ordered  with  12,11.  However,  blocks  1131,1231  amd  123,111  are  concurrent  because 
neither  the  tag  1131,1231  nor  any  entry  in  its  associated  coordination  list  is  ordered  with 
123,111. 

Coordination  lists  are  similar  to  parent  vectors  in  that  it  is  necessary  to  keep  coordi- 
nation lists  only  for  the  cxirrently  executing  blocks.  New  coordination  lists  are  formed  by 
merging  the  coordination  lists  of  aU  parents  and  adding  indirect  pjirents.  Comparing  amy 
two  English-Hebrew  labels  requires  0{N)  work.  Because  the  number  of  English-Hebrew 
tags  is  unbounded,  the  coordination  list  cannot  be  directly  indexed  as  an  array.  Hence, 
the  cost  of  checking  for  a  conflict  at  each  variable  access  is  bounded  by  0{N  X  C).  This 
problem  can  be  addressed  by  maintedning  the  English- Hebrew  labels  in  both  the  concur- 
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Figure  4.3:  English-Hebrew  Labeling 

rency  lists  and  in  the  access  histories  in  sorted  order,  so  that  standard  binary  search  and 
sort-merge  techniques  cein  be  applied. 

An  implementation  difficiilty  with  English-Hebrew  labeling  is  managing  the  tags.  Since 
English  and  Hebrew  labels  are  of  vwiable  length,  storage  edlocation  becomes  an  issue.  If 
tags  are  stored  directly  in  access  histories,  the  amoimt  of  storage  needed  is  significantly 
increased,  since  the  number  of  access  histories  is  0(F),  and  it  is  not  unrezisonable  to 
assimie  V  to  be  of  the  order  of  several  million  If  the  tags  are  stored  indirectly,  there  is  a 
garbage  collection  issue.  In  this  implementation,  English-Hebrew  labels  are  stored  indi- 
rectly in  access  histories  to  decrease  access  history  size.  The  generation  count  optimization 
(described  below)  is  used  to  perform  geirbage  collection. 

4.1.3      Task  Recycling 

The  implementation  of  the  ttisk  recycling  technique  uses  modification  sets  to  maintadn 
parent  vectors  and  the  simple  MRU  jdgorithm  to  perform  task  sissignment.  Because  most 
pcirallel  scientific  Fortran  programs  do  not  have  a  complex  concurrency  structure,  the 
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simple  MRU  algorithm  generally  performs  well  for  this  class  of  programs.  In  fact,  it  is 
guaiTcinteed  to  be  optimal  for  all  of  the  benchmark  programs. 

In  order  to  minimize  contention  in  assigning  and  freeing  tasks,  severed  free  task  sets  are 
associated  with  each  node  in  the  free  task  dag.  When  a  block  needs  to  be  assigned  a  task 
from  node  n,  one  of  the  free  task  sets  associated  with  n  is  selected  based  on  a  "random" 
value  (e.g.  the  processor  id).  If  the  number  of  free  task  sets  associated  with  a  node 
is  similar  to  the  underlying  parallelism,  the  contention  for  the  free  task  dag  is  greatly 
reduced,  2dthough  more  tasks  may  be  needed.  In  our  implementation,  the  Fetch&Add 
operation  is  used  to  select  a  free  task  set,  and  the  number  of  free  task  sets  is  equal  to  the 
tmder lying  pcirallelism  of  eight. 

4.1.4      Generation  Counts 

A  generation  count  is  a  mecheinism  for  grouping  blocks  that  are  potentially  concurrent. 
The  generation  is  incremented  every  time  the  execution  of  the  prograim  is  decreased  to 
a  single  thread  or  increased  to  more  thcin  one  thread.  Two  blocks  that  are  not  in  the 
saime  generation  are  gueiranteed  not  to  be  concurrent.  However,  two  blocks  in  the  same 
generation  are  not  necessarily  concurrent.  Further  anedysis  is  required  to  determine  the 
concurrency  relationship  between  blocks  in  the  same  generation. 

For  instance,  there  are  seven  generations  in  the  POEG  in  Figure  4.4.  All  blocks 
in  an  outermost  paircdlel  construct  are  in  same  generation.  No  block  in  generation  4  is 
concurrent  with  ciny  block  in  preceding  generations  1,  2  or  3  or  subsequent  generations  5, 
6  or  7.  However,  not  all  blocks  in  generation  4  cire  concurrent. 

Generation  counts  ase  used  by  both  task  recycling  and  English  Hebrew  labeling  to 
significantly  reduce  the  space  and  computation  time  requirements.  Since  mciny  scientific 
Fortrain  programs  consist  of  long  series  of  generations,  similao-  benefits  may  be  achieved 
for  many  programs  in  this  class.  (In  contrast,  many  Ada  progrzim  axe  a  single  generation, 
since  concurrent  tasks  can  be  created  when  a  program  begins  which  terminate  only  when 
the  program  completes.  For  this  class  of  progreims,  the  generation  count  will  only  increase 
access  history  sizes  without  decreasing  the  overall  computation  time  or  space.) 

To  use  generation  counts,  the  current  generation  count  is  stored  with  each  entry  in 
an  access  history.  When  a  block  checks  an  entry  in  an  access  history,  it  first  checks  the 
generation  coimt.  If  the  generation  coimt  of  the  entry  is  less  than  the  current  generation 
count,  the  block  cannot  be  concurrent  with  the  entry  and  therefore  does  not  need  to  check 
its  parent  vector  (in  the  case  of  task  recycling)  or  compeire  tags  (in  the  case  of  English 
Hebrew  labeling).  A  concurrency  check  is  performed  only  if  a  V2iriable  is  accessed  by  two 
blocks  in  the  same  generation. 

For  the  English-Hebrew  labeling  technique,  all  tags  associated  with  blocks  in  earlier 
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Figure  4.4:  POEG  with  Generations 


generations  can  be  discarded  when  a  new  generation  is  entered.  This  method  for  per- 
forming garbage  collection  can  result  in  dramatic  space  savings.  In  addition,  the  nimiber 
of  expensive  tag  comparisons  is  decreased.  For  task  recycling,  generation  coimts  decrease 
the  amoimt  of  work  performed  maintaining  parent  vectors.  In  peirticulair,  peirent  vectors 
need  to  acciu-ately  represent  ancestor  relationships  only  among  blocks  in  the  seime  gen- 
eration. Therefore,  no  parent  vector  maintenance  is  needed  in  progrzims  with  no  nested 
peirallehsm  or  coordination. 
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4.1.5       Parallel  Fortran  Environment 

The  "nm-until-completed"  schedtiling  pEiradigm  is  the  method  used  in  the  Ultracom- 
puter  paradlel  Fortran  and  in  many  other  paredlel  Fortran  environments.  In  the  nin-imtil- 
completed  model,  a  thread  created  in  a  doall  will  not  block  until  it  terminates  at  the 
associated  endall  operation.  Although  parzillel  scientific  codes  t)rpicaUy  exhibit  a  high  de- 
gree of  nominal  parallelism,  at  any  point  in  time  the  number  of  threads  created  but  not 
terminated  is  limited  to  P,  the  underlying  parallelism  of  the  machine.  Therefore,  at  most 
P  parent  vectors  or  coordination  lists  cire  actuedly  associated  with  currently  executing 
threads  at  any  point  during  the  execution. 

In  the  parzdlel  Fortran  environment,  each  of  P  actual  processes  perform  the  work  of 
several  pairaUel  threads,  amd  the  differences  eimong  the  private  parent  vectors  of  these 
threads  are  very  smaU.  Task  recycling  takes  advantage  of  this  property  to  reduce  the 
work  performed  In  initieilizing  parent  vectors  in  programs  with  strictly  nested  pziraiUel 
constructs  (e.g.  no  nested  series  of  paicillelism). 

A  private  stack  of  N  "template"  pairent  vectors  is  cached  where  N  is  the  level  of 
parallelism  nesting.  After  a  block  terminates,  the  modification  set  associated  with  its 
current  parent  vector  is  used  to  update  the  new  parent  vector  as  described  in  Section  3.2. 
The  modification  set  is  also  used  to  remove  ciU  changes  made  to  the  old  peirent  vector  p, 
thereby  setting  p  back  to  its  initieil  state.  When  the  next  parallel  thread  is  executed,  a 
new  pcirent  vector  does  not  have  to  be  edlocated  and  re-roitialized.  The  number  of  times  a 
change  is  either  added  or  backed  out  of  a  parent  vector  is  equal  to  the  nesting  level  of  the 
block  that  is  associated  with  the  change.  Thus,  the  cost  per  block  for  maintaining  peirent 
vectors  is  reduced  from  0(T)  to  0{N)  and.  the  amount  of  work  per  process  is  0(T  x  N). 


4.2     Program  Behavior  Results 

The  first  goal  in  performing  these  experiments  is  to  gain  insight  into  the  programming 
style  fotmd  in  "reed"  paredlel  progreims.  In  peirticulsir,  we  were  interested  in  gathering 
data  about  their  concurrency  structure  and  sheired  variable  access  patterns.  Information 
about  the  concurrency  structtire  allows  us  to  estimate  average  values  for  the  maximtim 
concurrency,  T,  aind  toted  number  of  blocks,  B.  It  also  indicates  the  viability  of  various 
optimization  techniques  and  the  expected  benefit  of  the  modified  MRU  task  assignment 
algorithm.  Shewed  variables  access  patterns  influence  the  size  and  work  required  for 
niedntcdning  access  histories.  Our  findings  on  parallel  program  behavior  eire  presented 
below. 
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4.2.1      Concurrency  Structure 

The  concmrency  structure  of  all  four  benchmaxk  paraJlel  prograins  is  qmte  simple:  there 
is  very  limited  nesting  of  doall  constructs  emd  minimal  synchronization.  This  means  that 
the  optimizations  described  in  Sections  3.2  eind  4.1  significeintly  improve  the  performance 
of  task  recycling  and  English- Hebrew  labeling.  In  addition,  the  simple  MRU  algorithm  is 
guarainteed  to  perform  an  optimal  task  assignment  for  all  of  the  benchmark  programs. 

However,  the  degree  and  granularity  of  parallelism  varies  considerably  among  the 
benchmark  programs. 

•  Triso  has  coarse  granularity  parcdlelism  with  limited  synchronization.  It  consists  of 
a  single  doall  operation  which  creates  8  peir£illel  threads;  these  threads  subsequently 
perform  two  barrier  synchronization  operations. 

•  Simple  has  mediimi  grcinularity  pairallelism  with  some  coordination.  It  performs 
10  doall  operations  that  each  create  124  parallel  threads,  and  130  doall  operations 
that  create  from  10  to  30  threads.  In  addition,  during  10  phases  of  execution 
approximately  15  concurrent  threads  execute  a  criticad  section. 

•  Finite  exhibits  a  large  degree  of  fine  grainularity  parallelism  and  does  not  perform 
any  coordination.  It  performs  60  doall  operations  that  each  create  1000  parallel 
threads;  50  doall  operations  that  create  250  threads,  and  200  doall  operations  that 
create  between  2  and  32  threads.  Each  block  performs  a  very  limited  amount  of 
computation;  in  many  cases,  a  block  consists  of  a  single  operation  on  an  array 
element. 

•  Polymer  exhibits  a  large  degree  of  mediimi  gremularity  parallelism  and  does  not 
coordinate.  It  has  one  level  of  nested  parallehsm;  the  first  three  benchmark  programs 
do  not  have  nested  paradlelism.  It  performs  40  nested  doall  operations;  the  outer 
operations  create  1000  pjiradlel  threads  each  of  which  creates  3  parallel  sub-threads. 
In  addition,  it  performs  20  doall  operations  which  create  350  parzdlel  threads  £Lnd 
10  doall  operations  which  create  100  parallel  threads. 

Table  4.1  svmimaries  the  concurrency  parameters  for  the  benchmark  programis. 

The  experimental  results  presented  in  Section  4.3  show  that  the  concurrency  structure 
of  the  program  has  a  significant  impact  on  the  cost  of  maintaining  concurrency  informa- 
tion. For  instance,  the  tinoptimized  version  of  tcisk  recycling  performs  very  poorly  on 
programs  with  fine  griinularity  parallehsm,  such  as  Finite.  Surprisingly,  the  concurrency 
structure  does  not  significantly  impact  the  cost  per  variable  access,  as  is  discussed  in  the 
following  section. 
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Total  # 

Maximum 

Program 

of  Blocks 

Concurrency 

Triso 

24 

8 

Simple 

4,090 

124 

Finite 

74,900 

1,000 

Polymer 

128,000 

3,000 

Table  4.1:  Concurrency  Structure 

4.2.2     Shcired  Variable  Access  Patterns 

The  work  and  space  required  for  maintaining  access  histories  is  proportional  to  the  average 
size  of  the  reader  set.  While  theoretically  the  size  of  the  reader  set  may  grow  to  the 
maximum  concurrency  of  the  POEG,  in  practice  the  ntmiber  of  concurrent  readers  is  much 
smaller.  This  is  due  to  a  common  technique  for  designing  parallel  scientific  programs. 
Many  parallel  scientific  programs  distribute  the  workload  by  partitioning  data  among 
concurrent  threads.  Hence  a  thread  often  shares  data  with  one  or  two  neighboring  threads, 
but  seldom  shaires  data  with  all  other  threads.  Thus,  one  would  expect  the  number  of 
concurrent  readers  to  be  limited  in  programs  based  on  this  "data  pairtitioning"  paradigm. 

The  sheired  memory  access  patterns  measured  for  the  benchmark  programs  support 
this  conclusion  and  are  shown  in  Figure  4.5.  The  number  of  concurrent  readers  goes  from 
1  to  9  cilong  the  z-axis,  and  the  percentage  of  accessed  shared  variables  that  are  read  by 
at  most  R  concurrent  blocks  goes  from  45%  to  100%  along  the  y-axis.  (Shared  veiriables 
that  axe  never  accessed  are  not  iucluded  in  these  statistics.)  For  exeunple,  95%  of  the 
shcired  Vciriables  in  the  Triso  program  are  never  read  by  more  thjin  two  concurrent  readers 
at  any  point  during  execution. 

The  most  significant  result  of  these  measxtrements  is  that  the  number  of  concurrent 
readers  tends  to  be  very  small.  In  all  four  programs,  more  than  80%  of  the  variables  aie 
never  read  by  more  than  two  concurrent  readers  and  almost  50%  are  never  read  concur- 
rently. In  addition,  there  appears  to  be  little  correlation  between  the  nvmiber  of  concurrent 
readers  and  the  degree  of  peiraJlelism  of  the  program.  For  example,  the  Triso  aind  Polymer 
prograims  have  fairly  similzir  access  patterns  for  the  majority  of  their  vziriables  (excepting 
the  9.3%  in  the  Polymer  program  with  more  than  9  concurrent  readers).  However,  Triso 
has  modest  parallelism  while  Polymer  has  nested  paradlelism  of  a  very  high  degree. 

Table  4.2  shows  estimates  of  the  average  reader  set  size^.    In  fact,  these  figures  are 


'Any  variable  with  more  than  9  concuiient  readers  was  assumed  to  have  the  TnaTimiim  numbei  of 
concunent  readers.  For  the  first  three  benchmark  programs,  this  does  not  affect  the  estimated  average 
reader  set  siie.  However,  10%  of  the  variables  in  the  Polymer  program  have  more  than  9  concurrent 
readers.  In  this  case  we  are  slightly  less  conservative  and  assimie  that  these  variables  are  accessed  by  the 
average  number  of  concurrent  blocks. 
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Figure  4.5:  Number  of  Concurrent  Readers 

pessimistic  with  respect  to  the  average  reader  set  size,  since  they  assimae  that  a  variable 
that  is  accessed  by  at  at  most  n  concurrent  blocks  at  some  point  during  the  execution 
aJways  has  n  entries  in  its  reader  set.  A  variable  with  a  TnaYJmiiTn  nvmaber  of  concurrent 
readers  of  n  may  actuaJly  have  a  much  smaller  reader  set  size  throughout  most  of  the 
execution  of  the  progrjim. 

A  direct  consequence  of  the  limited  reader  set  sizes  is  that  the  estimated  number  of 
shared  variables  that  Jire  read  per  block  is  much  smaller  thain  the  total  number  of  veiriables. 
(If  a  program  has  V  shared  variables,  average  concurrency  of  Tave^  and  average  reader  set 
size  of  iZ,  the  average  nimiber  of  variables  read  per  block  is  computed  as  j^-)  Since  the 
average  nimiber  of  variables  read  by  a  block-£is  well  as  the  average  number  of  concurrent 
blocks  that  read  a  given  shared  variable-is  very  small,  the  access  history  based  algorithms 
are  preferable  to  the  approach  used  by  the  Merge  algorithm  [Sch89,Sni88]  in  which  the 
storage  per  variable  depends  only  on  the  maximum  concurrency  of  the  POEG  and  not  on 
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Ave.  Reader 

%  Vars  Read 

Program 

Set  Size 

Per  Block 

Triso 

1.25 

23% 

Simple 

1.69 

7% 

Finite 

2.71 

1% 

Polymer 

168 

9% 

Table  4.2:  Estimated  Average  Reader  Set  Sizes  and  Vairiables  Accessed  Per  Block 


the  number  of  accessing  blocks. 


4.3     Monitoring  Results 


The  second  goal  in  perfonning  these  benchmark  measurements  is  to  understand  the  actual 
performcince  impact  of  on-the-fly  access  cinomaly  detection  on  parallel  progrcims.  To  this 
end,  the  elapsed  running  time,  CPU  time  zmd  space  requirements  for  the  task  recycling 
technique  were  measured  aind  compaired  with  the  costs  incurred  by  English-Hebrew  la- 
beling. The  primary  difference  between  the  two  algorithms  is  that  the  cost  per  variable 
access  is  clearly  smaller  in  task  recycling  than  in  English- Hebrew  labeling.  Because  veiri- 
able  access  is  frequently  the  most  common  operation,  this  is  an  important  attribute  of 
task  recycling.  However,  concurrency  information  management  is  potentially  much  less 
expensive  for  English- Hebrew  labeling.  The  benchmark  results  provide  some  insights  into 
the  significance  of  the  tradeoffs  between  the  two  adgorithms.  These  measiorements  also 
enable  us  to  evaluate  the  effectiveness  of  various  optimization  strategies. 
Four  versions  of  each  program  were  executed: 

Unmonitor  -  unmonitored  program 

Concurrency  -  mcdntains  concurrency  information  but  accesses  to  shared 
variables  cire  not  monitored 

Monitor(l)  -  monitors  every  shared  variable  using  a  reader  set  size  of  one 

Monitor(2)  -  monitors  every  shared  variable  using  a  reader  set  size  of  two 

These  versions  isolate  the  costs  of  meiintaining  concurrency  information  and  updating 
access  histories. 

4.3.1      Space  Requirements 

Figure  4.3  compares  the  storage  requirements  for  maintaining  concurrency  information  in 
task  recycling  cind  English-Hebrew  labeling.  As  is  shown  in  Figure  4.3,  English-Hebrew 
labeling  requires  substanticdly  more  space  thcin  task  recycling  when  the  generation  count 
optimization  is  not  used.    For  example,  almost  a  megabyte  of  memory  is  required  to 
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PTogram 

Task 
Recycling 

English- Hebrew 

Unopt 

Opt 

Triso 

Simple 

Finite 

Polymer 

2 
6 

29 
279 

8 

80 

954 

7,000t 

2 

7 

20 

239 

fEstimated  Value 

Table  4.3:  Concurrency  Information  Space  Requirements  (in  Kbytes) 

monitor  the  Finite  program  with  the  English- Hebrew  labeling  technique  whereas  the  task 
recycling  technique  uses  only  29  kilobytes.  In  fact,  English- Hebrew  labeling  could  not 
monitor  the  Polymer  program  due  to  memory  constraints  on  the  Ultracomputer. 

When  generation  coimts  are  used,  however,  task  recycling  and  English-Hebrew  labeling 
require  approximately  the  same  cimount  of  space  for  maintaining  concurrency  information. 
(All  English-Hebrew  labels  are  discarded  at  the  end  of  every  generation.)  Therefore,  the 
generation  count  optimization  is  very  important  since  it  broadens  the  class  of  programs 
supported  by  the  English-Hebrew  labeling  technique.  For  programs  with  infrequent  seri- 
cdization  points,  the  excessive  space  requirements  of  English-Hebrew  labeling  may  make 
its  use  infeasible. 

4.3.2      CPU  Times 

The  user  mode  CPU  time  of  a  parallel  program  P  is  the  sum  of  the  execution  times  of  all 
of  the  threads  in  P,  and  it  reflects  the  total  aimount  of  work  performed  by  P.  Figure  4.6 
shows  the  user  mode  CPU  time  for  executing  the  Concurrency  version  of  each  benchmark 
program  with  both  the  optimized  and  unoptimized  versions  of  the  ainomcJy  detection 
algorithms.  ( Concurrency  isolates  the  cost  of  meiintaining  the  concurrency  information 
from  the  overall  cost  of  detecting  access  anomalies.)  The  times  are  shown  as  a  percentage 
of  the  CPU  time  for  the  Unmonitor  version,  and  are  computed  as: 

Concurrency  X  100 
Unmonitored 

For  example,  there  is  175%  increase  in  CPU  time  to  maintain  concurrency  information 
for  the  Polymer  prograim  using  the  unoptimized  version  of  the  t£isk  recycling  technique. 

As  can  be  seen,  cost  of  maintaining  conciurency  information  can  be  substanticil  for 
progrJims  with  very  high  degrees  of  fine-graiined  paxallelism.  For  example,  the  overhead 
inctirred  by  unoptimized  task  recycling  is  twice  as  much  as  English- Hebrew  labeling  for 
Simple  and  Finite  and  up  to  six  times  as  much  work  is  performed  as  executing  the  original 
versions  of  the  programs.  (The  unoptimized  version  of  English-Hebrew  labeling  could  not 
be  used  to  monitor  Polymer  due  to  memory  limitations.) 
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One  can  also  compare  the  cost  of  maintaining  concurrency  information  both  with  and 
without  optimizations  in  place.  When  the  generation  count  optimization  is  used,  task 
recycling  performs  as  well  or  better  than  English-Hebrew  labeling  for  a\l  progreims.  Thus, 
the  generation  coimt  optimization  leads  to  a  decrease  in  the  computation  cost  for  task 
recycling  that  is  as  striking  as  the  decrease  in  space  for  English-Hebrew  labeling. 

Figure  4.7  displays  the  increase  in  user  mode  CPU  time  when  monitoring  access  to 
shared  vciriables.  In  pairticulcir,  the  ConcuTrency,  Monitor(l)  and  Monitor(2)  versions 
of  each  progrsim  were  executed  using  the  optimized  anomaily  detection  algorithms.  The 
most  importcint  result  shown  in  Figure  4.7  is  that  English- Hebrew  labeling  requires  more 
time  than  task  recycling  for  monitoring  variables,  even  when  the  cost  of  Tnaintaining 
concurrency  information  is  compcirable  (for  example,  Poljmaer). 

A  compjirison  of  the  execution  times  for  Monitor(l)  and  Monitor(2)  in  Figure  4.7 
shows  that  reader  set  sizes  ccin  be  increased  from  one  entry  to  two  entries  with  little 
additional  cost  for  task  recycling  (less  than  30%  for  all  programs).  The  percentage  of 
variables  guaranteed  to  have  at  least  one  eoiomaly  detected  is  increased  by  60%  (from 
50%  to  80%).  As  reader  sets  grow  in  size,  the  cost  of  the  concurrency  check  becomes 
more  importzmt.  Therefore,  the  overhead  grows  more  rapidly  for  English-Hebrew  labeling 
than  for  task  recycling. 

4.3.3     Elapsed  Running  Times 

An  cdtemative  metric  of  the  computation  overhead  of  executing  a  parallel  program  is  the 
elapsed  running  time  of  the  execution.  This  reflects  the  overall  cost  for  executing  the 
program  in  a  stand-alone  environment  cind  perhaps  is  the  more  interesting  measurement 
for  users  of  an  einomjily  detection  system.  Figure  4.8  shows  the  increase  in  total  elapsed 
times  for  the  Concurrency  and  Monitor(2)  versions  of  each  of  the  benchmark  programs 
using  the  optimized  anomaly  detection  algorithms. 

Miiintaining  concurrency  information  results  in  a  2%  -  78%  increase  in  elapsed  running 
time,  while  monitoring  all  accesses  to  £l11  shared  variables  with  reader  sets  of  size  two  incurs 
a  140%  to  350%  increase  in  elapsed  times. 

The  difference  between  the  elapsed  and  CPU  time  increases  stems  from  Amdahl's  law 
[Amd67].  Almost  aH  of  the  overhead  of  access  anomalies  detection  is  performed  in  parallel, 
tind  most  of  the  benchmark  programs  contain  a  non-trivizJ  amiount  of  sericd  execution. 
The  overhead  of  both  mainteuning  concurrency  information  and  monitoring  accesses  to 
shared  variables  could  be  greatly  reduced  by  using  static  analysis  to  minimize  the  amount 
of  monitoring  actucilly  performed.  Even  if  this  preprocessing  is  not  performed,  the  increase 
in  elapsed  time  is  small  enough  to  allow  on-the-fly  access  anomaly  detection  to  be  a  viable 
debugging  tool. 
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Figure  4.6:  Percentage  Increase  in  CPU  Time  for  Maintaining  Concurrency  Information 
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Figiire  4.7:  Percentage  Increase  in  CPU  Time  for  Monitoring  Accesses 
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Chapter  5 

Critical  Section  Coordination  and 
Nondeterminism 


The  standcird  representation  of  critical  section  coordination,  described  in  Section  2.3,  adds 
coordination  edges  to  the  POEG  to  explicitly  model  the  execution  order  of  criticaJ  section 
instcinces.  While  this  ordered  representation  is  correct,  amomaly  detection  is  computa- 
tionailly  expensive.  Every  execution  order  of  the  critical  sections  must  be  ainjdyzed,  since 
cmomalies  can  be  masked  by  the  coordination  edges  used  to  model  a  given  execution  order. 
In  many  ceises,  however,  critical  sections  coordination  is  "well-behaved"  in  the  sense 
that  the  execution  order  of  critical  sections  does  not  affect  the  progreim  execution.  Consider 


X  .=  0 

doall  t  :=  1  to  n 

j  ■■=  m 

lock(i) 

X  ■.=  X  +  j 

unlock(X) 

endall 

T  :=  X 


Figure  5.1:  Ord*"  Independent  Progr£im 

the  program  in  Figure  5.1.  The  critical  sections  in  this  program  compute  the  value 
/(I)  +  /(2)  +  . . .  +  /(n).  This  siun  is  independent  of  the  order  in  which  the  terms  are 
added.  Since  only  the  fin£il  result  is  used,  the  outcome  of  the  progr£im  is  unaffected  by 
the  order  in  which  the  critical  sections  execute.  In  this  chapter  we  propose  an  alternative 
unordered  representation  for  critical  section  coordination  that  does  not  model  the  critical 
section  execution  order  in  the  POEG  and  describe  criteria  for  determining  when  this  new 
representation  is  reliable. 


The  efficiency  and  effectiveness  of  dynamic  access  anomaly  detection  is  greatly  im- 
proved when  the  unordered  representation  is  used. 

1.  Less  ordering  is  added  to  the  POEG  so  that  more  anomcilies  are  exposed  in  a  given 
execution  instcince. 

2.  Theorem  5  applies  when  a  prograim  P  contcdns  no  other  immatched  coordination. 
Thus,  a  single  execution  insteince  is  sufficient  to  prove  that  P  is  anomaly  free  for  an 
input  vector  I.  (If  the  ordered  representation  is  used,  N\  execution  instances  must 
be  analyzed  where  N  is  the  number  of  concurrent  critical  sections.) 

In  addition,  critical  sections  that  aire  reliably  modeled  by  the  unordered  representation 
cam  be  ignored  in  program  analysis  and  debugging.  For  instance,  these  criticjil  sections  do 
not  have  to  be  traced  by  trace-and-replay  debuggers,  since  their  execution  order  does  not 
affect  subsequent  execution.  Section  5.1  describes  the  unordered  representation  of  critical 
section  coordination. 

Unfortunately,  the  imordered  representation  does  not  idways  reliably  model  the  in- 
teraction between  the  critical  sections  and  the  remednder  of  the  progrjim.  Section  5.2 
addresses  the  problem  of  determining  when  a  progreim's  execution  is  dependent  on  the 
execution  order  of  critical  sections.  SpecificcJly,  the  notion  of  nondeterminism  stemming 
from  the  execution  order  of  critical  sections  is  refined  jind  three  independent  types  of  non- 
determinism  identified:  parallel,  reference  and  sequential  nondeterminism.  The  unordered 
representation  is  guaranteed  to  be  reliable  if  a  set  of  critical  sections  is  parallel,  sequential 
aind  reference  deterministic. 

Section  5.3  presents  compiler-based  static  cinalysis  eilgorithms  for  detecting  these  three 
types  of  nondeterminism.  (Besides  determining  the  reliability  of  unordered  representation, 
isolating  ciny  type  of  nondeterminism  is  useful  for  understeinding  the  logic  of  a  paredlel 
program.)  Piirjillel,  reference  and  sequential  nondeterminism  can  be  detected  relatively 
accurately.  Only  information  about  variables  accessed  within  critical  sections  is  needed, 
and  critical  sections  aire  generedly  small  sections  of  code  that  access  a  largely  disjoint  set  of 
vjiriables.  In  contrast,  statically  determining  that  a  set  of  concurrent  threads  contcdn  no 
access  iinomalies  is  relatively  difficult,  since  this  requires  obtaining  accurate  information 
for  every  vairiable  that  is  read  and  written  in  a  concurrent  thread.  Thus,  it  may  be 
possible  to  prove  at  compile  time  that  a  given  program  is  paraiUel,  reference  and  sequentizd 
deterministic,  but  not  to  show  that  it  is  anomaly  free. 

Figure  5.2  presents  a  block  diagram  of  a  genered  nondeterminism  anaJysis  system.  The 
output  from  the  static  ainalysis  phase  (comprised  of  parallel,  sequential,  and  reference  non- 
determinism detection  and  static  access  anomaly  detection)  is  used  by  the  dynamic  access 
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Figure  5.2:  Nondeterminism  Detection  System 

anomaly  detection  system  to  decide:   (i)  which  representation  to  use  for  critical  section 
coordination,  and  (ii)  which  accesses  are  potential  anomalies  and  must  be  monitored. 

Heuristics  can  make  access  anomaly  detection  tractable  even  in  the  presence  of  non- 
determinism. For  instzince,  a  practical  dynamic  access  anomaly  detection  system  will 
generally  not  attempt  to  prove  that  a  program  Jind  input  vector  pair  is  auiomaly  free, 
since  this  is  often  intractable.  When  the  only  correctness  reqiiirement  of  a  dyneimic 
anomaly  detection  system  is  that  no  false  anomalies  aire  reported,  certain  types  of  non- 
determinism can  be  ignored.  Section  5.4  presents  severad  alternative  representations  of 
critical  section  coordination  that  make  dynamic  anomaily  detection  more  effective  through 
a  better  classification  and  semantic  understainding  of  the  critical  section  coordination  in 
a  program. 

5.1      Unordered  Critical  Sections  Representation 

The  ordered  representation  for  critical  sections,  presented  in  Section  2.3,  models  the 
execution  order  of  the  critical  sections  explicitly  by  adding  coordination  edges  to  the 
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POEG.  The  coordination  edges  from  unlock  operations  to  lock  operations  represent  an 
arbitrary  ordering  of  events,  not  an  ordering  that  is  intended  or  necesseiry.  Thus,  the 
ordered  representation  does  not  accurately  reflect  the  semantics  of  criticed  sections.  Often, 
more  ordering  is  added  than  intended  by  the  programmer,  making  anomaly  detection  more 
expensive  than  necessairy. 

The  unordered  representation  is  an  cdtemative  way  of  modeling  critical  section  coor- 
dination. The  unordered  representation  ignores  the  execution  order  of  criticcd  sections 
so  that  no  coordination  edges  aie  added  to  the  POEG.  The  benefit  of  the  unordered 
representation  is  that  it  compresses  many  POEGs  into  a  single  POEG,  thereby  exposing 
Einomalies  that  otherwise  can  be  detected  only  by  examining  many  execution  instances. 
Moreover,  any  critical  section  that  cein  be  reliably  modeled  by  the  imordered  representa- 
tion is  matched.  (Refer  to  Section  2.4  for  the  definition  of  matched.)  Since  Theorem  5 
applies  to  this  class  of  critical  sections,  a  single  execution  instance  is  sufficient  to  prove 
that  a  program  with  unordered  critical  sections  eind  no  other  nondeterminism  is  anomaly 
free  for  a  given  input  vector. 

Because  no  coordination  edges  are  added  to  the  POEG,  an  alternative  method  must 
be  used  to  ensure  that  accesses  made  within  concurrent  criticail  sections  are  not  incorrectly 
reported  as  access  anomalies.  It  is  insufficient  simply  to  not  monitor  these  accesses,  since 
ainomalies  must  be  reliably  reported  when  accesses  inside  and  outside  critical  sections 
conflict.  SpecificaJly,  read  events  performed  within  a  critical  section  conflict  with  simple 
write  events  (e.g.  write  events  performed  outside  a  criticcil  section),  and  a  write  performed 
within  a  criticeJ  section  conflicts  with  simple  read  eind  write  events.  Accesses  within 
criticiil  sections  can  be  correctly  monitored  through  the  use  of  lock  covers^. 

By  the  semamtics  of  criticeil  sections,  every  access  in  a  critical  section  is  protected  by 
the  lock  associated  with  the  criticjil  section.  A  lock  cover  is  associated  with  each  access 
amd  contains  all  of  the  locks  that  cire  held  when  the  access  is  performed,  v^n  access  is 
covered  by  more  than  one  lock  when  critical  sections  are  nested:  an  access  has  an  empty 
lock  cover  if  it  is  not  performed  in  a  critical  section. 

Lock  covers  cire  used  in  detecting  conflicts  between  accesses.  Accesses  that  axe  covered 
by  the  same  lock  are  guaranteed  not  to  execute  at  the  same  time-regardless  of  concurrency 
relationships  of  the  blocks  performing  the  accesses-because  of  the  semantics  of  critical 
sections.  Therefore,  any  two  accesses  that  have  some  lock  in  common  never  conflict.  Two 
accesses  that  do  not  have  a  lock  in  common  axe  treated  as  simple  read  and  write  events. 
This  representation  is  robust  in  that  it  allows  for  the  detection  of  critical  sections  that 
were  neglected  to  be  locked  or  inadvertently  protected  by  an  incorrect  lock. 


'Lock  covers  are  similar  to  the  notion  of  flavors  proposed  by  Callahan  and  Kennedy  for  the  static 
analysis  of  critical  sections  [CK87]. 
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Bernstein's  conditions  can  be  modified  to  reflect  the  semantics  of  lock  covers.  Let 
Writei  denote  the  set  of  write  events  with  lock  cover  L  and  Writej^  denote  the  set  of 
write  events  that  are  not  covered  by  any  lock  in  L.  The  set  of  simple  write  operations  is 
Writei  where  X  =  0.  (In  this  case,  L  contains  ail  locks.)  Readi  and  Ready-  are  defined 
similarly.  The  conditions  for  detecting  access  anomalies  between  two  concurrent  blocks 
Si  and  Sj  are  as  follows: 

Modified  Bernstein's  Conditions: 

ReadiiSi)  D  Wntej^Sj)  =  0 
WriteiiSi)  n  Read^Sj)  =  0 
WriteiiSi)  n   Writej;{Sj)  =  0 

The  execution  order  of  Si  tind  Sj  does  not  introduce  amy  access  tinomalies  if  and  only  if 
the  above  three  conditions  are  met  for  every  lock  cover  L. 

To  illustrate  the  use  of  lock  covers,  consider  the  POEG  in  Figure  5.3  that  models  an 
execution  instance  of  a  modified  version  of  the  progreim  fragment  in  Figure  2.8.  (Accesses 


A[l]  := 

3  ~L  X 
X:=lI 

■■=  A[l] 


A[2]  := 

j  ■=L  X 

X:=l2 

:=A[1] 


^[3]  := 

3  •=L  X 
X:=l3 

A[3]  :=  X 
■■=  A[l] 


doall  i  :=  1,3 
A[i\  :=  ... 
lock(L) 
j:=X 
X  :=« 
unlock(X) 
...  :=  A[\] 

endall 


Figure  5.3:  Unordered  Representation  of  Critical  Section  Coordination 

with  lock  cover  {L}  are  denoted  by  "=i,"  in  the  POEG.)  In  this  execution  instance, 
iterates  ij,  t2  and  ts  enter  the  critical  section  in  that  order.  The  accesses  of  X  inside 
the  critical  section  are  not  reported  as  access  anomalies  even  though  they  are  concurrent, 
since  all  events  have  the  sjune  lock  cover,  {X},  and  therefore  do  not  conflict.  However, 
the  read  of  X  outside  the  critical  section  by  13  is  correctly  detected  as  an  access  anomaly 
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with  respect  to  the  critical  section  write  operations.  Moreover,  the  imsaife  accesses  to  A[l] 
are  detected  in  every  execution  instance,  regardless  of  the  order  in  which  iterates  11,12 
and  is  enter  the  criticzd  section. 

5.1.1      Access  Histories  and  Lock  Covers 

The  algorithms  for  detecting  access  anonuilies  and  updating  the  access  histories  must  be 
extended  to  consider  lock  covers.  Each  entry  in  an  access  history  contcdns  information 
about  the  lock  cover  associated  with  the  access.  Since  most  events  are  not  performed 
within  criticad  sections,  each  access  history  has  distinct  CS-reader  and  CS-writer  sets 
that  record  read  cind  write  events  with  non-empty  lock  covers.  Distinguishing  between 
simple  eind  critical  section  read  eind  write  events  minimizes  the  number  of  set  operations 
performed  when  checking  and  updating  an  access  history.  It  also  saves  space,  since  an 
empty  lock  cover  is  not  stored  with  simple  read  aind  write  events. 

When  a  block  executes  a  read  event  in  a  critical  section,  the  Check- CS- Read  algoTithm, 
shown  in  Algorithm  5.1,  checks  for  access  anomalies.  In  particular,  a  criticad  section  read 
event  e  conflicts  with  aU  events  in  the  writer  set  that  axe  conctirrent  with  e  and  aH  events 
w  in  the  CS-writer  set  that  £ire  concurrent  with  e  such  that  there  is  no  lock  that  covers 
both  e  and  w.  A  simple  read  operation  e  conflicts  with  aU  events  in  the  writer  smd  CS- 
writer  sets  that  are  concurrent  with  e,  as  shown  in  algorithm  Check-Read.  A  similar  pair 
of  cdgorithms.  Check- CS- Write  and  Check-Write,  are  performed  to  check  simple  write  and 
criticcd  section  write  events. 

Lock  covers  complicate  the  subtraction  optimization  for  access  histories.  In  particular, 
it  is  no  longer  the  case  that  every  future  event  that  conflicts  with  an  ancestor  of  the  current 
event  e  also  conflicts  with  e.  To  illustrate  this,  consider  the  POEG  in  Figure  5.4.  Suppose 
that  blocks  61,  62»  ^3  ^^d  64  write  X  in  that  order  and  the  access  history  for  X  is  initieJly 
empty.  After  block  61  writes  X,  an  event  with  lock  cover  {A}  is  added  to  CS-Writer(X). 
Suppose  when  block  62  writes  X,  the  event  in  CS-Writer(X)  is  overwritten  with  lock 
cover  {A,B}.  When  block  63  writes  X,  the  conflict  with  the  write  of  X  in  61  will  not 
be  detected.  Alternatively,  suppose  after  block  62  writes  X,  the  event  in  CS-Writer(X) 
is  overwritten  with  the  intersection  of  the  lock  covers  of  bi  smd  62;  neimely  {A}.  When 
block  64  writes  X,  it  will  be  incorrectly  detected  as  conflicting  with  the  write  by  62- 

Instead,  an  event  a  in  the  access  history  can  be  subtracted  by  the  current  access  e 
when  either:  (1)  a  conflicts  with  e,  or  (2)  a  was  performed  by  an  ancestor  of  e  and 
Locks{e)  C  Locks{a).  In  both  cases,  an  entry  a  is  deleted  only  if  any  future  access  that 
conflicts  with  a  also  conflicts  with  the  current  access. 

After  a  block  performs  a  read  event  in  a  critical  section,  it  updates  the  access  history 
using  the  Subtract-CS-Read  algorithm  shown  in  Algorithm  5.2.  It  subtracts  all  ancestors 
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procedure  Check-CS-Read(b,  X) 

ii  IsConcurrentfb,  Writer (X ))  thtn  report  Access  Anomaly 
for  all  a  in  CS-Wnter(X)  do 

if  Locks(b)  n  Locks{a)  =  0  and  IsConcurrent(h,  a)  then 
report  Access  Anomaly 
endfor 
end  procedure 

procedure  Check-Read(b,  X) 

for  all  a  in  CS-WrHer(X)  U  Writer(X)  do 

if  IsConcurrentfb,  a)  then  report  Access  Anomaly 

endfor 
end  procedure 

procedure  Check- CS- Write (b,  X) 

for  all  o  in  Reader(X)  U  Writcri"^;  do 

if  IsConcurrentfb,  a)  then  report  ylcce5S  ylnomaZy 
endfor 
for  all  a  in  C5- W^nter/'A';  U  CS-Reader(X)  do 

if  £ocfc»(6)  n  Locks(a)  =  0  and  IsConcurrentfb,  a)  then 
report  Access  Anomaly 
endfor 
end  procedure 

procedure  Check- Write (b,  X) 

for  all  a  in  Reader(X)  U  C75- Writerf'A';  U  CS-Reader(X)  U  W'nter/'X;  do 
if  IsConcurrent(b,  a)  then  report  i4cce4«  Anomaly 

endfor 
end  procedure 

Algorithm  5.1:  Check  for  Read  Jind  Write  Events  with  Lock  Covers 


from  CS-Reader(X)  that  has  some  lock  in  common  with  its  lock  cover.  Read  events 
simply  subtract  all  ancestors  from  the  reader  and  CS-reader  sets,  as  shown  in  Subtract- 
Read.  Similar  algorithms  are  performed  for  subtracting  access  histories  for  critical  section 
write  and  simple  write  events. 

A  consequence  of  the  imordered  representation  is  that  the  size  of  access  histories  will 
tend  to  be  larger  thcin  when  the  ordered  representation  is  used.  First,  CS-reader  sets  are 
likely  to  contain  more  entries  than  reader  sets  since  the  programming  paradigms  that  limit 
reader  set  sizes  do  not  apply  to  accesses  within  critical  sections.  A  variable  is  generally 
protected  by  a  critical  section  because  it  is  accessed  by  many  concurrent  threads.  In  the 
ordered  representation,  these  accesses  would  be  ordered  and  therefore  subtracted  from 
the  access  history. 

Second,  there  can  be  nonconcurrent  events  in  a  CS-reader  or  CS-writer  set.  The  size 
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X:=B 


Figure  5.4:  Subtraction  Exziniple 

of  the  CS-reader  and  CS-wxiter  sets  cire  a  function  of  the  power  set  of  the  number  of  locks. 
(The  upper  bound  on  the  number  of  events  in  an  access  history  associated  with  any  given 
lock  cover  is  still  bounded  by  the  maTimiiTn  concurrency  of  the  POEG  for  CS-reader  sets 
eind  one  for  CS-writer  sets.)  It  is  possible  for  programs  to  have  many  locks;  consider  a 
lock  array  that  protects  individual  rows  of  a  matrix.  However,  it  is  reasonable  to  expect 
that  a  small  nimiber  of  different  lock  covers  cire  associated  with  any  given  variable,  since 
complex  locking  patterns  can  lead  to  deadlocks  and  Introduce  excessive  serialization. 

5.2     Unordered  vs.  Ordered  Representation 

Although  the  unordered  representation  reliably  models  certain  criticaJ  sections,  it  can  fail 
to  meet  both  Reliability  Properties  1  find  2.  When  the  unordered  representation  does 
not  correctly  model  the  ordering  intended  by  the  programmer,  false  anomalies  can  be 
reported.  Consider  the  progrcim  in  Figure  5.5  in  which  the  criticed  sections  pass  access 
control  to  entries  of  airray  A.  There  axe  no  access  anomzdies  in  this  program.  However, 
in  those  execution  instances  in  which  iterate  Jk  executes  the  critical  section  immediately 
before  iterate  m,  the  write  of  A[k]  by  iterate  k  and  the  read  of  A[k]  by  iterate  m  wUl  be 
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procedure  Subtract-CS-Readfb,  X) 

for  all  a  in  CS-Reader(X )  do 

if  Locka[h)  C  Locka(a,)  and  not  IsConcuTTvnt(b,a)  then  delete  a  from  CS-Reader(X) 

endfor 

add  6  to  CS-Reader(X) 
end  procedure 

procedure  Subtract- Read (b,  X) 

for  all  a  in  Reader (X)  U  CS-Reader(X)  do 

if  not  IsConcurrent(b,a)  then  delete  a  from  Reader(X)  or  CS-Reader(X) 

endfor 

add  6  to  Reader(X) 
end  procedure 

procedure  Subtract-CS-Write(h,  X) 

for  all  a  in  CS-Reader(X)  U  C5- Writer  do 

if  £ocA:«(&)  C  I'Ocfc5(a)  and  not  IsConcurrent(b,a)  then 
delete  a  from  CS-Reader(X)  or  CS-  Writer(X) 
endfor 
for  all  a  in  Reader (X )  do 

if  not  IsConcurTent(h,a)  then  delete  a  from  Reader(X) 
endfor 

add  6  to  C5-  H^rjter/'Jf; 
end  procedure 

procedure  Subtract- Write(b,  X) 

for  all  a  in  Reader(X)  U  CS-Reader(X)  U  C5- W^rtier('X;  do 
if  not  IsConcurrent(b,a)  then 

delete  o  from  Reader(X)  or  CS-Reader(X)  or  C5-  W^rtterl^Xj 
endfor 

H'H<er('A';  :=  fc 
end  procedure 

Algorithm  5.2:  Subtraction  for  Read  and  Write  Events  with  Lock  Covers 

falsely  identified  as  ajx  access  zinomaly  if  the  unordered  representation  is  used. 

Similarly,  £inom£illes  can  be  missed  if  only  a  single  execution  instaince  is  monitored. 
Consider  the  program  in  Figure  5.6.  In  those  execution  instances  in  which  the  first  iterate 
to  execute  the  critical  section  is  not  iterate  1,  no  anomsily  occiirs.  However,  in  those 
execution  instances  in  which  iterate  1  is  first,  there  is  an  access  anomaly  between  write 
of  j4[1]  by  iterate  1  in  the  critical  section  and  the  reads  of  A[\\  by  iterates  2  . . .  n  outside 
of  the  critical  section. 

Based  on  the  preceding  ex£imples,  one  may  (incorrectly)  suspect  that  the  unordered 
representation  is  rehable  if  and  only  if  no  variable  is  accessed  inside  and  outside  a  critical 
section  by  concurrent  threads.  However,  there  are  many  cases  where  this  simple  test  fails: 
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X  1=0 

doall  t  :=  1  to  n 

initialize{A[i]) 

lock(i) 

j:=X 

X  :=  i 

unlock(X) 

proceis{A[j]) 
endall 

Figure  5.5:  Prograim  with  False  Anomalies 

init  :=  false 
doall  t  :=  1  to  n 

j  :=  A[l] 

lock(I) 

if  not  init  then  A[i]  :=  j 

init  :=  true 

unlock(£) 
endall 

Figure  5.6:  Program  with  Order  Dependent  Anomiilies 
1.  The  programi  in  Figure  5.7  cjin  be  vedidly  modeled  by  the  unordered  representation. 


X  :=  0 

doall  t  :=  1  to  n 

A[i\  :=ixX 

lock(X) 

X  ■.=  X  +  1 

unlock(L) 
endall 


Figure  5.7:  Program  with  Order  Independent  Execution 

However,  variable  X  is  accessed  inside  and  outside  the  criticed  section  by  concurrent 
threads. 

2.  The  value  of  X  in  the  progrzim  in  Figure  5.8a  is  dependent  on  the  execution  order  of 
the  criticiil  section.  However,  the  only  V2iriable  accessed  inside  and  outside  critical 
sections,  X,  is  accessed  by  a  block  that  is  not  concurrent  with  einy  critical  section. 

3.  Similarly,  the  value  of  j  in  the  program  in  Figure  5.8b  is  dependent  on  the  execution 
order  of  the  critical  sections  even  though  no  vairiable  is  accessed  both  inside  cind 
outside  the  criticed  section. 

The  remciinder  of  this  section  defines  when  the  unordered  representation  is  reliable. 
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X  :=  0  doall  t  :=  1  to  n 

doall  »  :=  1  to  n  j  :=  odd 

lock(Z,)  lock(I) 

X  —  i-X  X  :=  X  +  i 

unlock(i)  if  X  mod  2  =  0  then 

endall  unlock(£) 

T  :=  X  j  :—  even 

else  unlock(£) 
endall 

(a)  (b) 

FigTire  5.8:  Two  Programs  with  Order  Dependent  Execution 

In  most  programs,  execution  within  a  critical  section  is  determined  in  part  by  the 
criticid  sections  that  have  already  executed.  However,  the  execution  of  the  remainder 
of  the  program  is  affected  only  when  this  nondeterminism  is  propagated  outside  of  the 
criticjj  sections.  We  can  distinguish  between  three  ways  the  indeterminate  behavior  within 
critical  sections  Ccin  introduce  nondeterminism  in  the  rest  of  the  program.  The  following 
definitions  are  based  on  ein  outermost  parallel  construct  C  aind  lock  i,  where  Cl  denotes 
the  set  of  critical  sections  in  C  that  are  covered  by  lock  L,  and  Cj^  denotes  the  set  of 
blocks  in  C  that  are  not  covered  by  lock  L^  . 

•  Parallel  Nondeterministn 

Definitions: 

A  variable  X  in  a  critical  section  in  Cl  has  an  indeterminate  value  with  respect 
to  i  if  and  only  if  the  series  of  vedues  assigned  to  X  are  bcised  on  the  execution 
order  of  the  critical  sections  in  Cl- 

C  is  parallel  nondeterministic  with  respect  to  lock  L  if  and  only  if  there  is  an 
indeterminate  value  V  with  respect  to  L  and  a  block  B  in  Cj^  such  that: 

(i)  B  is  control  dependent  on  V,  or 

(ii)  V  is  used  by  B,  and  B  is  ordered  with  V  in  at  least  one  execution  insteince'. 

More  intuitively,  pcirallel  nondeterminism  occurs  when  the  order  in  which  concurrent 
threads  execute  the  critical  sections  in  a  parallel  construct  C  affects  the  execution 
within  C  itself.  For  example,  the  program  in  Figures  5.5  is  paradlel  nondeterministic. 
The  critical  sections  aie  used  to  pass  access  control  over  the  entries  of  array  A  among 


Only  programs  with  parallel  corutructs  are  considered  in  thij  chapter:    programs  that  have  uncon- 
strained fork  and  join  operations  require  slightly  different  definitions,  theorems  and  proofs. 

'If  a  value  V  and  B  are  always  concurrent,  then  they  will  be  detected  as  an  access  anomaly  in  every 
execution  instance  in  which  they  both  appear.  Since  our  goal  is  to  prove  the  nonexistence  of  access 
anomalies,  ignoring  the  affect  of  V  on  £  does  not  introduce  any  circularity. 
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the  concurrent  iterates.  Thus,  which  entry  of  array  A  a  thread  T  processes  depends 
on  which  thread  entered  the  critical  section  immediately  before  T. 

All  paridlel  constructs  that  contain  one  tjrpe  of  potential  access  anomaly-naimely,  a 
criticaJ  section  write  event  cind  a  simple  read  event  in  a  block  that  follows  another 
critical  section-aire  parallel  nondeterministic.  However,  einomalies  with  critical  sec- 
tion read  events,  criticaJ  section  write  events  that  conflict  with  read  events  which  do 
not  follow  another  criticcJ  section,  or  critical  section  write  events  that  conflict  with 
other  write  events  will  not  cause  the  pfiraUel  construct  to  be  identified  as  parallel 
nondeterministic. 

One  may  be  tempted  to  use  a  stricter  definition  of  pareillel  nondeterminlsm  by 
modifying  condition  (ii)  as  follows: 

(ii')  V  is  used  by  5,  and  B  is  ordered  with  V  in  aXl  execution  instance. 

Unfortunately,  when  this  definition  is  used  ,  the  interaction  among  the  paraillel 
iterates  is  not  correctly  captured,  and  Lemma  10-and  therefore  Theorem  17  (see 
below  )-does  not  hold. 

•  Reference  Nondeterminism 

Definitions: 

A  variable  X  in  a  critical  section  in  C^  is  an  indeterminate  variable  with  respect 
to  X  if  and  only  if  whether  X  is  accessed  is  based  on  the  execution  order  of  the 
critical  sections  in  Cl- 

C  is  reference  nondeterministic  with  respect  to  lock  L  if  cind  orJy  if  an  inde- 
terminate vairiable  X  with  respect  to  L  is  accessed  by  a  block  B  in  C-^,  and  B 
and  an  access  to  X  axe  concurrent  in  at  least  one  execution  instance. 

A  program  is  reference  nondeterministic  when  the  order  in  which  threads  execute  the 
critical  sections  in  C  affects  whether  or  not  an  access  ^  in  a  critical  section  occurs, 
and  i4  is  a  potentiaJ  access  ainonuily.  Reference  nondeterminism  is  illustrated  by  the 
program  in  Figure  5.6.  The  entry  of  array  A  which  is  written  in  the  criticeil  section 
depends  on  which  iterate  executes  the  critical  section  first.  Conflicting  access  to 
i4[l]  occurs  only  when  iterate  1  executes  the  critical  section  first. 

•  Sequential  Nondeterniinisni 

Definitions: 

A  variable  X  in  a  criticeJ  section  in  Cl  has  an  indeterm,inate  result  with  respect 
to  i  if  cind  only  if  the  fined  Vedue  of  X  is  based  on  the  execution  order  of  the 
critical  sections  in  Cl- 

C  is  sequential  nondeterministic  with  respect  to  lock  L  if  and  only  if  there  is  a 
block  B  that  is  a  descendant  of  C  and: 
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(i)  an.  indetenninate  value  with  respect  to  L  is  propagated  to  B,  or 
(ii)  an  indetenninate  result  with  respect  to  L  is  used  by  B. 

Sequentiail  nondeterminism  occurs  when  the  execution  order  of  the  critical  sections 
in  C  eiffects  execution  after  C  terminates.  Sequential  nondeterminism  is  illustrated 
by  the  program  fragment  in  Figure  5.8a.  In  this  program,  variable  X  is  computed  in 
the  critical  sections  and  is  used  only  after  the  pjirallel  construct  completes  execution. 
The  value  of  X  depends  on  the  criticed  section  execution  order,  since  subtraction  is 
not  associative.  If  the  subtraction  operations  were  changed  to  addition  operations, 
this  program  would  be  sequentied  deterministic,  since  the  final  result  of  X  would  be 
deterministic. 

These  three  types  of  nondeterminism  are  orthogonal:  the  conditions  under  which  each 
occurs,  the  importzince  of  their  detection,  and  their  detection  algorithms  are  independent. 
The  concepts  of  parallel,  reference  cind  sequential  nondeterminism  axe  used  to  de- 
fine when  the  imordered  representation  is  sufficient  to  model  a  set  of  concurrent  critical 
sections. 

Theorem  17:  If  a  parallel  construct  C  is  parallel,  reference  eind  sequential 
deterministic  with  respect  to  a  lock  i,  then  the  unordered  representation  meets 
Reliability  Properties  1  and  2. 

Theorem  18:  If  a  parjdlel  construct  C  is  parallel  and  reference  deterministic 
with  respect  to  a  lock  L,  then  the  imordered  representation  meets  Reliability 
Properties  1  and  2  within  C  and  Reliability  Property  1  outside  C. 

A  direct  consequence  of  Theorem  17  is  that  one  execution  instance  is  sufficient  to  guarantee 
that  a  POEG  deterministic  progrcim  and  input  vector  pair  (P,/)  is  anomaly  free  if  P 
contzdns  critical  section  coordination  and  is  parallel,  sequential  and  reference  deterministic 
with  respect  to  all  locks.  More  generally,  whenever  a  set  of  criticid  sections  is  parallel, 
reference  and  sequential  deterministic,  its  execution  order  can  be  ignored  in  any  program 
analysis  or  debugging. 

Theorem  18  states  that  the  anomaly  detection  within  a  paxadlel  construct  C  is  in- 
dependent of  subsequent  sequentiid  nondeterminism.  If  the  imordered  representation  is 
used  for  a  parallel  construct  C  that  is  parallel  and  reference  deterministic  but  sequential 
nondeterministic,  no  false  anomadies  will  be  reported  anywhere  in  the  program.  However, 
anomidies  may  be  hidden  in  a  given  execution  instance. 

Lemma  10  In  the  absence  of  access  anomalies  detected  when  the  ordered  representation 
is  used,  if  a  parallel  construct  C  is  parallel  deterministic  with  respect  to  L,  then  the 
execution  of  blocks  in  Cj^  are  independent  of  the  execution  order  of  the  critical  sections  in 
Cl. 
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Proof.  Suppose,  for  the  sake  of  contradiction,  this  were  not  true  for  an  execution  instance 
E  and  the  execution  of  a  block  in  Cj^  is  dependent  on  the  execution  order  of  the  critical 
sections  in  Cl.  Information  about  the  execution  order  of  the  critical  sections  in  Cl  must 
be  conveyed  to  some  block  in  Cj-.  In  the  absence  of  access  anomalies,  the  only  accesses 
within  critical  sections  in  Cl  that  are  dependent  on  the  execution  order  of  the  criticcd 
section  are  indeterminate  variables,  values  eind  results  (by  definition). 

Information  can  be  conveyed  dtiring  the  execution  of  a  prograim  only  by  modifying  the 
control  flow  or  data  flow  of  the  progTcim.  Both  of  these  cases  aire  addressed  below: 

•  Since  C  is  parallel  deterministic  with  respect  to  L,  no  block  in  C^  is  control  depen- 
dent on  an  indeterminate  value  in  a  critical  section  in  Cl-  Therefore,  no  information 
is  pfissed  by  control  flow. 

•  Since  C  is  parallel  deterministic  with  respect  to  L,  any  indeterminate  value  D  with 
respect  to  lock  L  that  reaches  a  use  U  in  C-^  is  concurrent  with  [7  in  all  execution 
instances.  If  D  reaches  ^  in  £,  an  access  anomaly  would  have  been  detected; 
otherwise,  no  information  about  the  execution  order  is  conveyed  {D  could  occur 
after  U  is  used).  Therefore,  no  information  is  passed  by  data  flow. 

In  the  absence  of  access  anomalies,  no  information  about  the  execution  order  of  the  criticcd 
sections  Cl  is  conveyed  to  a  block  in  Cj^  through  control  or  data  flow.  Hence  the  execution 
of  every  block  in  Cj^  must  be  independent  of  the  execution  order.  D 

Lemma  11  With  all  other  sources  of  nondeterminisTn  fixed,  if  a  parallel  construct  C 
is  parallel  and  reference  deterministic  with  respect  to  lock  L,  and  an  access  anomaly  is 
detected  in  at  least  one  execution  instance  when  the  ordered  representation  is  used,  then 
an  anomaly  is  guaranteed  to  be  detected  in  every  execution  instance  when  the  unordered 
representation  is  used. 

Proof.  If  an  anomaly  occurs  during  an  execution  instzince  E  when  the  unordered  repre- 
sentation is  used,  the  proof  is  complete.  Otherwise,  suppose  for  the  sake  of  contradiction, 
that  an  anomaly  is  not  detected  in  E  when  the  unordered  representation  is  used,  but  is 
detected  in  another  execution  instemce  E'. 

Since  C  is  parallel  deterministic  for  L,  by  Lenoma  10  the  actions  of  each  block  in 
C^  are  independent  of  the  execution  order  of  the  critical  sections  in  Cl-  In  particiUar, 
when  all  other  sources  of  nondeterminism  are  fixed,  the  veiriables  read  and  the  variables 
written  by  every  block  in  C-^  are  the  same  in  E  and  E'  upto  and  including  the  first  access 
anomaly.  Since  C  is  reference  deterministic,  every  variable  in  a  critical  section  in  Cl  that 
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is  a  potentially  access  anomaly  and  is  read  (resp.  written)  in  £  is  also  read  (resp.  written) 
in  E'.  Thus,  each  block  in  C  accesses  the  same  set  of  variables  in  E  eind  E'. 

Since  the  unordered  representation  does  not  modify  the  POEG,  and  all  other  sources 
of  nondeterminism  are  fixed,  the  ancestor  relationship  is  identical  for  E  and  E'.  Therefore, 
if  an  anomaly  is  detected  in  E\  an  coiomaly  must  also  be  detected  in  E.  D 

Lemma  12  If  a  parallel  construct  C  is  parallel  and  reference  deterministic  with  respect  to 
L,  then  anomalies  will  not  be  falsely  reported  when  the  unordered  representation  is  used. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  an  anomaly  is  falsely  reported  between 
blocks  bi  and  bj  in  sn  execution  instance  E  in  which  the  criticjj  sections  in  Cl  are  modeled 
by  the  imordered  representation.  Assume  6j  happened  to  execute  before  bj  in  E.  (A 
symmetric  eirgument  holds  if  bi  executed  after  bj.)  Since  there  is  no  ainomaly  detected 
when  the  ordered  representation  is  used,  block  6j  (or  one  of  its  descendants)  is  a  critical 
section  cs,  in  Cl  eind  bj  (or  one  of  its  ancestors)  is  a  critical  section  csj  in  Cl  such  that 
csi  executed  immediately  before  csj  in  E. 

Let  E'  be  an  execution  instance  with  all  sources  of  nondeterminism  fixed  in  the  same 
way  SlS  E  except  that  csi  executes  immediately  after  csj  in  E'.  This  execution  order  must 
be  able  to  occur  since  csj  aind  csi  are  concurrent  In  the  imordered  representation  of  E. 
There  are  three  diflferent  relationships  between  csi,  csj,  bi  and  bj  in  E'  when  the  ordered 
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Figure  5.9:  Partial  POEG  for  E' 

representation  is  used  (Figure  5.9  shows  a  partial  POEG  for  the  ordered  representation 
ofE'): 

1.  If  CSi  ^  i>i  then  bi  and  bj  are  concurrent,  since  csj  and  all  of  its  descendants  are 
concurrent  with  cdl  of  the  ancestors  of  cj,. 

2.  K  CSi  =  bi  and  csj  /  bj  then  bi  and  bj  are  concurrent,  since  ca,  is  concurrent  with 
all  of  the  descendeints  of  csj. 
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3.  If  cs;  =  6i  and  csj  =  bj  then  no  anomaly  could  have  been  reported,  since  two 
accesses  in  critical  sections  covered  by  the  same  lock  never  conflict. 

In  all  cases  6j  and  bj  are  concurrent  in  the  ordered  representation  of  E'.  By  the  same 
logic  as  the  proof  of  Lemma  11,  bi  and  bj  access  the  same  set  of  variables  in  E  and  E'. 
Therefore,  an  anomaly  between  bi  £ind  bj  wUl  also  be  detected  in  E'  2ind  hence  is  not  a 
false  anomaJy.  D 

Lemma  13  In  the  absence  of  access  anomalies,  if  a  parallel  construct  C  is  sequential 
deterministic  with  respect  to  L,  then  the  program  state  after  the  execution  of  C  is  inde- 
pendent of  the  execution  order  of  the  critical  sections  covered  by  lock  L  in  C. 

Proof.  Suppose,  for  the  sake  of  contradiction,  that  the  program  state  after  the  execution 
of  C  were  dependent  on  the  execution  order  of  critical  sections  in  Cl  and  neither  condition 
(i)  nor  condition  (ii)  In  the  definition  of  sequential  nondeterminism  was  met.  Because  C  is 
a  pciraUel  construct,  the  code  that  is  executed  on  the  exit  of  C  cannot  be  control  dependent 
on  a  condition  in  C. 

There  are  two  other  cases  to  consider: 

1.  If  C  is  paradlel  deterministic,  then  by  Lemma  10  there  are  no  values  computed  in 
the  blocks  in  C^  that  cire  affected  by  the  execution  order  of  the  criticcJ  sections  in 
Cl-  Therefore,  if  a  result  \alue  in  a  critical  section  in  Cl  is  not  propagated  outside 
of  the  construct  (i.e.  condition  (ii)  is  not  met),  the  execution  of  subsequent  code 
cannot  be  dependent  on  the  execution  order  of  critical  sections  in  Cl. 

2.  If  C  is  pairallel  nondeterministic,  by  Lemma  10  the  only  additioncJ  values  computed 
in  C  that  ase  iiffected  by  the  execution  order  of  the  criticcil  sections  in  Cl  are 
those  that  cire  propagated  from  a  use  that  meets  condition  (i).  Therefore,  if  neither 
condition  (i)  nor  (ii)  is  met,  the  code  following  C  cannot  be  dependent  on  the 
execution  order  of  critical  sections  in  Cl- 

Hence,  the  program  state  after  the  execution  of  the  construct  must  be  independent  of  the 
execution  order.  D 

Theorem  17  If  a  parallel  construct  C  is  parallel,  reference  and  sequential  deterministic 
with  respect  to  a  lock  L,  then  the  unordered  representation  meets  Reliability  Properties  1 
and  2. 

Proof.  Lemma  11  aind  Lemma  12  guairantee  that  the  unordered  representation  will  re- 
liably detect  access  ainomalies  within  C  if  C  is  parallel  and  reference  deterministic  with 
respect  to  L.    Lemma  13  proves  that  the  remainder  of  the  execution  is  independent  of 
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the  execution  order  of  the  critical  sections  in  Cl  if  C  is  sequential  detenninistic  with 
respect  to  L.  Therefore,  the  unordered  representation  correctly  captures  the  coordination 
stemming  from  the  criticad  sections  covered  by  L  in  C  D 

Theorem  18  If  a  parallel  construct  C  is  parallel  and  reference  deterministic  with  respect 
to  a  lock  L,  then  the  unordered  representation  meets  Reliability  Properties  1  and  2  within 
C  and  Reliability  Property  1  outside  C . 

Proof.  Reliability  within  C  follows  directly  from  Lemma  11  and  Lemma  12.  In  addition, 
no  false  anomalies  can  be  reported  in  code  following  C  since  it  is  not  concurrent  with  any 
access  in  C  D 

5.3      Detecting  Parallel,  Reference  and  Sequen- 
tial Nondeterminism 

This  section  presents  algorithms  that  use  compiler-based  static  anadysis  techniques  to 
detect  sequentiaJ,  parallel  eind  reference  nondeterminism.  As  in  most  static  analysis, 
there  is  a  trade-off  between  efficiency  and  an  exact  answer.  Several  assimaptions  are  made 
that  greatly  reduce  the  amount  of  work  performed  at  the  expense  of  increased  inaccuracy. 
As  a  consequence,  every  nondeterministic  parallel  construct  will  be  detected.  However, 
a  parciUel  construct  that  is  sequential  deterministic,  for  example,  may  be  identified  a« 
sequentizJ  nondeterministic. 

The  first  step  in  detecting  pairallel,  reference  and  sequential  nondeterminism  is  to 
determine  how  computation  within  critical  sections  is  propagated  to  the  remainder  of 
the  progTcim.  The  portion  of  the  code  that  is  reached  during  this  propagation  step  is 
the  only  part  of  the  program  that  is  potentially  affected  by  execution  within  the  critical 
sections.  Once  this  information  is  obtained,  we  check  individucdly  for  parallel,  sequential, 
and  reference  nondeterminism.  The  general  outline  for  these  algorithms  is  given  below. 

Suppose  we  want  to  test  a  parallel  construct  for  a  given  type  of  nondeterminism,  for 
instance,  sequential  nondeterminisna.  We  first  check  if  a  necess£U'y  reaching  condition  is 
met: 

Reaching  Condition.  Values  computed  within  a  critical  section  are  propa- 
gated outside  of  the  critical  section. 

If  the  reaching  condition  is  not  met,  the  parfdlel  construct  is  sequential  detenninistic,  since 
there  is  no  interaction  between  the  critical  sections  and  the  remainder  of  the  program. 

Otherwise,  each  access  that  meets  the  reaching  condition  is  further  anedyzed  to  see  if 
a  necessary  order  dependency  condition  is  met: 
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Order  Dependency  Condition.  Values  propagated  outside  critical  sections 
are  dependent  on  the  order  in  which  criticcil  sections  are  executed. 

Evaluating  order  dependency  determines  whether  or  not  the  propagated  value  is  depen- 
dent on  the  execution  order  of  the  criticzd  sections.  The  parallel  construct  is  guaranteed 
to  be  sequenticil  deterministic  if  either  the  reaching  or  order  dependent  condition  is  not 
met. 

Before  describing  the  eilgorithms,  we  briefly  define  the  data  structures  used  by  the 
static  analysis  algorithms:  the  pairallel  control  flow  graph  cind  program  dependence  graph. 
To  be  compatible  with  current  compiler  optimization  literatirre,  read  events  are  referred 
to  as  uses  and  write  events  are  referred  to  as  definitions.  In  the  remainder  of  this  section, 
the  terms  block,  critical  section  and  peiraUel,  reference  zmd  sequentieil  nondeterminism  axe 
defined  with  respect  to  a  given  lock  L.  However,  for  ease  of  exposition,  the  phrases  "  not 
covered  by  lock  X",  "covered  by  lock  £"  cind  "with  respect  to  lock  i"  will  not  be  stated 
explicitly. 
Petrallel  Control  Flow  Graph 

A  parallel  control  flow  graph  represents  an  explicitly  psiraUel  program  much  in  the 
same  way  that  a  sequentisd  control  flow  graph  represents  a  sequential  program.  Nodes 
in  the  pciredlel  control  flow  graph  correspond  to  sequences  of  instructions  with  one  entry 
point,  one  exit  point  and  no  internal  fork,  join  and  coordination  operations.  Examples 
of  paredlel  control  flow  graphs  include  the  synchronized  control  flow  graph  defined  by 
Callahan  and  Subhlok  [CS88]  and  the  annotated  flow  graph  of  Taylor  [Tay83,TO80].  The 
following  eJgorithms  use  two  types  of  information  obtained  from  a  pjirallel  control  flow 
graph:  nodes  that  can  potentially  execute  concurrently,  eind  the  set  of  deflnitions  that 
reach  a  given  use. 

It  is  difficult  to  compute  the  concurrency  relationship  exactly.  One  way  to  approximate 
the  concurrency  relationship  is  by  computing  the  SCPreserved  set  of  Callahan  and  Subhlok 
[CS88,CKS90].  A  node  n  is  in  SCPreserved(z)  only  if  n  precedes  x  in  every  execution 
instiince  in  which  they  both  appear.  Using  SCPreserved  sets,  the  concurrency  relationship 
is  computed  as  follows: 

Conctir(n)  =  {  x  :  x  ^  SCPreserved(n)  A  n  ^  SCPreserved(x)} 

This  equation  is  conservative:  a  node  i  is  in  Concur(n)  if  a;  and  n  are  concurrent.  However, 
a  node  y  which  is  never  concurrent  with  n  might  also  be  in  Concur(n). 

In  pareillel  programs  there  are  two  types  of  definitions  that  Ccin  reach  a  given  use: 


Reachs  (v.)  :  the  definitions  that  reach  use  u  from  nodes  that  are  ordered  with  u  in 
at  lecist  one  execution  instance. 
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Reachc(u)  :   the  definitions  that  reach  use  u  from  nodes  concurrent  with  u  in  at 
least  one  execution  instance. 

(Note  that  a  definition  can  be  In  both  the  Reachc  and  Reachs  sets  of  a  given  use.) 
Only  the  definitions  in  the  Reachs  set  axe  required  in  detecting  peirallel,  sequential  and 
reference  nondeterminism. 

Reachs  sets  are  calculated  from  the  parallel  control  flow  graph  in  the  same  manner 
as  in  sequential  data  flow  analysis  (described  in  [AU77]).  However,  an  extended  notion 
of  predecessor  is  used  to  correctly  model  criticiil  section  coordination.  In  any  execution 
instance,  a  value  that  reaches  the  end  of  a  critical  section  instaince  can  be  safely  used  by 
the  subsequent  critical  section  instance.  Therefore,  a  node  p  that  is  concurrent  with  a 
node  n  is  treated  as  an  additional  predecessor  of  n  if  three  conditions  hold:  (i)  p  and  n 
are  covered  by  the  same  lock  L,  (ii)  p  has  successors  not  covered  by  L,  and  (iii)  n  has 
predecessors  not  covered  by  L. 
Program  Dependence  Graph 

The  second  data  structure  used  is  the  program  dependence  graph  proposed  by  Fer- 
rante,  Ottenstein  and  Warren  [FOW87].  ProgTcim  dependence  graphs  allow  for  sim- 
pler representation  of  certain  programming  constructs  and  have  proved  invaluable  in 
m£iny  different  uses  in  static  analysis  of  programs  rzmging  from  from  parzillelization 
[ABC+88],[BCF"'"88],[BHRB89]  and  program  analysis  [Sel89]  to  creating  program  slices 
[HPR87],[HRB88].  Every  node  in  a  program  dependence  graph  corresponds  to  a  node  in 
the  parallel  control  flow  graph.  There  are  data  dependence  and  control  dependence  edges 
between  nodes. 

There  is  a  control  dependence  edge  from  node  n  to  node  c  with  label  T  if  c  (eventually) 
executes  whenever  the  predicate  at  the  end  of  n  has  value  T.  An  array  reference  A[i] 
is  represented  in  the  control  dependence  graph  as  being  control  dependent  on  the  index 
variable  t.  The  initial  basic  blocks  of  a  concurrent  thread  has  the  same  control  dependence 
relationship  as  its  associated  fork  operation.  Every  program  dependence  graph  has  an 
embedded  control  dependence  graph,  which  consists  of  control  dependence  edges,  and  a 
forward  control  dependence  graph  [CHH89]  which  is  the  control  dependence  graph  with 
all  back  edges  removed. 

In  contraist  to  control  flow  graphs,  data  dependences  are  explicitly  represented  in  a 
program  dependence  graph.  There  is  a  data  dependence  edge  in  a  progrsim  dependence 
graph  to  a  use  u  from  all  definitions  in  Reachs{u).  (A  definition  that  reaches  a  use  u  but 
is  always  concvirrent  with  u  is  not  represented  by  a  data  dependence  edge  in  the  progrsim 
dependence  graph.)  Severed  algorithms  exist  for  building  a  program  dependence  graph 
[FOW87,CFR+89]. 

To  illustrate  the  structure  of  a  program!  dependence  graph,  consider  the  program 
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fragment  and  associated  program  dependence  graph  in  Figure  5.10.  Data  dependences 
are  denoted  in  the  program  dependence  graph  by  dashed  lines  and  control  dependences 
by  solid  lines;  the  control  dependence  labels  are  shown  in  bold  face.    In  this  program 


A.  J  :=/(!) 


(^  Entry 


unlock(£] 


;:=2 


unlock(£l 


parallel  case 

J  :=  /(I) 

lock(i) 

X:=X+j 

if  X  =  2  then 
unlock(Z) 
>:=2 

else  unlock(i/) 

endif 

Jfc  :=  J  X  2 
parallel 

lock(L) 

X  :=  X  +  /(2) 

unlock(X) 
end  parallel  case 
y  :=  X 


Figure  5.10:  Program  Dependence  Graph 

dependence  graph,  every  node  is  a  single  statement.  Node  K  is  control  dependent  on  D 
because  it  executes  whenever  the  predicate  at  the  end  of  node  D  has  value  T.  Node  C  is 
data  dependent  on  A  because  the  use  of  j  in  C  is  reached  by  the  definition  of  j  in  A. 

5.3.1     Propagating  Critical  Section  Indeterminacy 

Our  first  goal  is  to  identify  all  potentially  indeterminate  behavior  within  criticjd  sections, 
and  see  how  this  affects  the  remainder  of  the  program.  In  doing  so  we  make  the  first 
simplifying  assimiption: 

Assumption  1.  Every  definition  in  a  criticjil  section  has  an  indeterminate  value. 
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It  is  possible  for  a  vedue  V  computed  within  a  critical  section  to  be  independent  of  the 
criticed  section  execution  order.  However,  since  the  amount  of  sequentizdization  incurred 
by  a  critic£d  section  is  linear  in  the  size  of  the  code,  actions  performed  within  the  critical 
section  are  generally  directly  dependent  on  the  current  state  of  the  variables  protected  by 
the  criticeJ  section.  Otherwise,  the  computation  of  V  could  be  moved  outside  the  critical 
section,  thereby  improving  the  code. 

We  propagate  information  about  critical  section  indeterminate  computation  using  pro- 
griim  slices.  A  slice  of  a  progTcim  is  defined  with  respect  to  some  program  point  P  and 
Vciriables  X  and  consists  of  all  of  the  definitions  smd  predicates  of  the  program  that  might 
affect  the  value  of  X  at  point  P  [Wei84].  Similarly,  a  forward  slice  of  a  progreim  is  defined 
with  respect  to  some  program  point  P,  and  consists  of  jJl  points  and  variables  that  might 
be  aiffected  by  the  computation  at  P.  Thus,  a  G  slice{b)  if  and  oidy  if  6  G  forward- 
slice{a).  By  computing  forwcird  slices  we  identify  cdl  code  that  is  possibly  affected  by 
computation  at  some  step  in  the  program. 

We  create  a  forward  slice  from  each  definition  with  the  criticid  section  by  traversing 
the  program  dependence  graph,  following  the  data  dependence  and  control  dependence 
edges.  If  a  node  n  is  reached  on  the  traversal,  then  every  predicate  eind  definition  within 
n  is  part  of  the  forweird  slice.  Therefore,  a  program  dependence  graph  in  which  each 
node  is  as  small  as  possible-for  example,  a  single  statement  rather  than  a  sequence  of 
statements-allows  more  accurate  resvdts  to  be  computed. 

During  the  traversal,  every  node  in  the  program  dependence  graph  has  one  of  three 
states  aissociated  with  it:  deterministic,  indeterminate  or  nondetermmistic.  Initially,  all 
nodes  in  the  program  dependence  graph  that  are  in  critical  sections  and  contain  definitions 
are  marked  indeterminate  (based  on  Assumption  1).  If  a  critical  section  node  n  is  reached 
during  the  traversal  by  only  other  critical  section  nodes,  the  state  of  n  is  indeterminate. 
Otherwise,  if  a  node  n  is  reached  during  the  traversal,  its  state  is  nondeterministic.  The 
distinction  between  indeterminate  eind  nondeterministic  nodes  is  important  when  detect- 
ing sequentiail  and  reference  nondeterminism.  For  instance,  if  the  state  of  the  root  node  of 
a  critical  section  A  is  nondeterministic,  then  whether  or  not  A  executes  at  all  in  a  given 
execution  instamce  can  depend  on  the  execution  order  of  preceding  critical  sections. 

Routine  Forward-Slice  in  Algorithm  5.3  is  used  to  propagate  potentitd  indeterminacy 
throughout  the  program.  The  amount  of  work  required  by  routine  Forward- Slice  is  the  cost 
of  traversing  the  progreim  dependence  graph  three  times,  0{\PDGc\)-  It  follows  directly 
from  results  by  Ottenstein  and  Ottenstein  [0084]  that  every  potentially  indeterminate 
node  will  be  detected  by  creating  forward  slices. 

To  illustrate  the  propagation  of  indeterminacy,  consider  the  program  dependence 
graph  in  Figure  5.10.    Before  the  traversal,  nodes  C  and  G  are  marked  indeterminate 
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procedure  Forward- SHce{C ,  L) 
for  all  n  €  PDGc  do 

if  I  €  Lock8(n)  and  Def(n)  ^  9  then 

3tate(n)  :=  indeterminate 
else  3taie(n)  :=  deterministic 
endfor 
for  all  n  £  PDGc  such  that  L  £  Locks(n)  and  Def(n)  56  0  do 

endfor 
end  procedure 

proced  u  re  Traverse{n) 

for  all  children  c  of  n  do 

if  3tate(c)  <  Btate{n)  then 
if  i  ^  Lock(c)  then 

staie{c)  :=  nondeterministic 
else  state{c)  :=  3tate(n) 
Traverse(c) 
endif 
endfor 
end  procedure 

{  Note:  deterministic  <  indeterminate  <  nondeterministic} 

Algorithm  5.3:  Propagating  Potential  Indeterminacy 

since  they  aire  criticcil  sections  nodes  and  they  contain  definitions.  After  the  traversal, 
the  set  of  indeterminate  nodes  is  {C,D,G,J,L}\  the  set  of  nondeterministic  nodes  is 
{K,E,I}.  Node  K  is  nondeterministic  since  it  is  control  dependent  on  Z?,  node  E  is 
nondeterministic  since  it  is  data  dependent  on  K,  and  node  /  is  nondeterministic  since  it 
is  data  dependent  on  nodes  G  and  C. 

5.3.2     Detecting  Parallel  Nondeterminism 

We  can  re-formulate  the  definition  of  pjirallel  nondeterminism  in  terms  of  reaching  and 
order  dependent  conditions  : 

Reaching  Condition:  There  is  a  use  C/  in  a  block  in  C,  a  definition  or 
predicate  Z?  in  a  critical  section  in  C,  and  D  G  Reachs{U)  or  U  is  control 
dependent  on  D. 

Order  Dependency  Condition:  D  has  an  indeterminate  value. 

Clezirly,  if  there  are  no  uses  in  a  psiredlel  construct  C  that  are  reached  by  critical  sec- 
tion definitions  or  dependent  on  critical  section  predicates,  C  is  parallel  deterministic. 
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However,  even  when  the  reaching  condition  is  met,  the  propagation  of  critical  section 
indeterminate  behavior  makes  the  detection  of  parallel  nondeterminism  very  simple. 

Because  of  Assimiption  1,  every  indeterminate  definition  within  C  is  assimaed  to  be  an 
indeterminate  value.  Thus,  the  fdgorithm  for  detecting  pairiillel  nondeterminism  is  trivicd. 
Every  access  A  that  is  in  a  nondeterministic  node  in  C  is  parallel  nondeterministic.  The 
pajcdlel  construct  C  is  paraJlel  nondeterministic  if  there  is  a  parallel  nondeterministic 
node  that  is  not  in  a  critical  section. 

To  illustrate  the  detection  of  pariillel  determinism,  consider  the  parallel  nondetermin- 
istic program  and  eissociated  program  dependence  graph  in  Figure  5.10.  Nodes  K  eind 
E  axt  paraillel  nondeterministic,  since  they  are  nondeterministic  and  in  the  pareillel  con- 
struct C.  Moreover,  because  they  are  not  part  of  a  critical  section,  the  paraillel  construct 
is  identified  as  parallel  nondeterminism.  (Node  /  is  not  paredlel  nondeterministic  since  it 
is  not  in  C.) 

5.3.3      Detecting  Sequential  Nondeterminism 

Similar  to  parallel  nondeterminism,  we  can  formiilate  the  definition  of  sequential  nonde- 
terminism in  terms  of  reaching  and  order  dependent  conditions  as  follows: 

Reaching  Condition:  There  is  a  use  U  that  is  a  descendant  of  C,  a  defini- 
tion D  €  Reachs{U)  and  £>  is  in  C 

Order  Dependency  Condition:  Either  (1)  Z?  in  a  critical  section  and  is 
zin  indeterminate  result,  or  (2)  D  is  parallel  nondeterministic. 

Clearly,  if  there  is  no  use  outside  of  the  parcillel  construct  C  that  is  reached  by  a  defini- 
tion within  C,  the  parallel  construct  is  sequenticd  deterministic.  Moreover,  any  use  that 
meets  order  dependence  condition  2  is  pessimisticjilly  assumed  to  be  sequentially  nonde- 
terministic (based  on  Assimaption  1).  If  neither  case  holds,  further  anzilysis  is  required  to 
determine  if  order  dependent  condition  1  is  met. 

In  contrast  to  indeterminate  values,  definitions  within  critical  sections  are  often  deter- 
minate results.  Consider  for  example,  a  program  for  computing  a  maximum  in  parallel. 
Although  the  value  of  the  maximum  at  any  intermediate  step  is  indeterminate,  the  final 
result  is  adways  the  same. 

Unfortimately,  determining  whether  or  not  a  definition  is  an  indeterminate  result  can 
be  quite  expensive.  All  possible  critical  section  execution  orders  and  ail  paths  through  each 
critical  section  must  be  analyzed.  (Since  D  is  not  parallel  nondeterministic,  all  variables 
that  are  used  to  compute  D  have  the  same  initial  value  in  any  execution  instance  and  the 
same  set  of  critical  sections  are  guaranteed  to  execute  in  all  execution  instJinces.)  If  there 
are  N  concurrent  criticail  sections  with  P  paths  per  critical  sections,  (P^)!  execution 
orderings  must  be  ainalyzed.  This  is  clearly  computationedly  intractable. 
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Rather  than  examine  Jill  execution  orders,  the  following  simplifying  assumption  is 
made: 

Assumption  2.  A  definition  Z?  is  an  indeterminate  result  if  the  value  of  D  is  depen- 
dent on  the  execution  order  of  at  least  one  pair  of  criticjd  section  instances. 

D  IS  &  determinate  resiilt  if  the  value  of  D  cifter  the  execution  of  every  pair  of  critical 
sections  is  independent  of  their  execution  order.  When  this  property  holds,  the  execution 
order  of  amy  two  adjacent  critical  sections  cam  be  exchanged  in  any  sequentialization  of 
the  criticiJ  sections  without  changing  the  overeill  result  of  the  execution.  Therefore,  an 
£irbitrary  execution  order  can  be  produced  by  performing  a  series  of  exchanges  without 
changing  the  final  result. 

Assumption  2  reduces  the  number  of  critical  section  execution  orders  that  must  be 
analyzed  to  0{PN^).  This  overhead  is  further  mitigated  by  two  factors.  First,  critical 
sections  are  generally  quite  small.  Thus,  there  is  little  code  to  generate  eind  very  few  paths 
to  consider.  Second,  in  homogeneous  parallelism-such  as  that  stemming  from  a  doall 
operation-the  code  for  all  critical  sections  is  identical  for  aH  paraillel  threads.  Therefore, 
a  pair  of  generic  instances  can  be  saialyzed  to  determine  if  the  order  dependence  criterion 
is  met. 

To  evaluate  order  dependence,  each  paii  of  critical  sections  Ai  Jind  Aj  is  analyzed 
using  symbolic  execution  [CR84].  Two  sequentisd  instruction  sequences  are  created  for 
each  pair:  one  in  which  A,  is  followed  by  Aj  and  einother  in  which  Aj  is  followed  by  ^4^. 
For  each  path  through  the  criticcil  sections,  the  two  instruction  sequences  are  symbolically 
executed.  For  each  path,  the  pair  of  finail  values  of  definition  D  are  converted  to  a  canoniczil 
form  £ind  aie  verified  to  be  computationally  identicjil.  In  genered,  a  program  is  sequentiad 
deterministic  only  if  the  operations  performed  in  the  critical  section  are  associative  and 
conunutative. 

To  illustrate  the  detection  of  sequential  nondeterminism,  consider  the  progreim  shown 
in  Figxire  5.10.  In  this  prograim  the  critical  sections  compute  the  sum  /(I)  +  /(2).  Only 
node  /  meets  the  reaching  condition.  Since  node  I  does  not  meet  order  dependent  condi- 
tion 2,  order  dependence  condition  1  must  be  evaluated. 

To  do  so  the  following  two  instruction  sequences  are  created: 

Ai  II  A2:  A2  II  All 

X  :=X  +  ji  X  :=X  +  /(2) 

X  :=X  +  /(2)  X  :=X  +  h 

Using  symbolic  execution,  the  final  values  of  the  variable  X  for  these  two  instruction 
sequences  are  computed  to  be  the  following: 

X  :=  (X  +  ji)  +  /(2)  Jf  :=  (X  +  /(2))  -^  h 

124 


These  two  expressions  are  computationally  equivalent.  Therefore,  the  above  program  is 
sequential  deterministic. 

5.3.4      Detecting  Reference  Nondeterminism 

Reference  nondeterminism  is  the  most  complicated  type  of  nondeterminism  to  detect.  We 
can  formulate  the  definition  of  reference  nondeterminism  in  terms  of  reaching  and  order 
dependent  conditions  as  follows: 

Reaching  Condition:  There  is  an  access  Ab  of  vjiriable  X  In  a  block  in  C, 
an  access  Ac  of  variable  X  in  a  critical  section  in  C,  and  Ab  or  Ac  is  a 
definition. 

Order  Dependency  Condition:  X  is  an  indeterminate  vsiriable. 

Cleiirly,  C  is  reference  deterministic  if  either:  (i)  there  are  no  potential  ainomalies  with 
critical  section  accesses,  or  (ii)  there  is  no  conditioned  execution  within  critical  sections. 
Otherwise,  a  more  sophisticated  analysis  is  required. 

To  evaluate  the  reaching  condition,  the  references  in  critical  sections  that  are  potential 
access  anomalies  are  isolated.  In  particular,  the  following  two  sets  cire  computed  for  each 
critical  section  A: 

PotR{A)  :  the  variables  read  in  A  that  are  potential  auiomjilies 
Potw{A)  :  the  vziriables  written  in  A  that  are  potential  anomalies 

Static  analysis  can  be  used  to  detect  potential  Jinomalies  [CS88],  or  Potfi{A)  and  Potw{A) 
can  be  conservatively  assumed  to  contadn  every  vairiable  accessed  in  A. 

To  evaluate  order  dependence,  each  criticaJ  section  is  analyzed  to  determine  if  the 
same  set  of  potentially  anomalous  variables  axe  read  and  written  regardless  of  the  critical 
section  execution  order.  To  do  so,  the  following  sets  aie  computed  for  each  critical  section 
node  n  in  A: 

MayR{n)  :  the  variables  in  Potfi{A)  with  read  events  control  dependent  on  n 
Mustii{n)  :  the  veiriables  in  Potji{A)  read  in  every  execution  that  includes  n 

The  adgorithm  for  computing  Maya  and  MusIr  sets-shown  in  Algorithm  5.4-uses  a 
forward  control  dependence  graph.  (The  definitions  and  algorithms  for  Mayw  and  Mustw 
sets  are  simileir  to  those  for  computing  May/?  eind  Must^j  but  use  the  Definition  set.)  It 
is  shown  in  [CHH89]  that  a  forwcird  control  dependence  graph  for  a  reducible  program  is 
a  dag.  Therefore,  the  May  aind  Must  sets  can  be  computed  by  performing  forward  and 
backwcird  topological  traversals  of  the  forwcird  control  dependence  graph. 

Reference  nondeterminism  is  detected  by  verifying  that  every  variable  in  Mayfi{n)  is 
also  in  Mustii{n)  for  aU  nodes  n  in  >1  that  are  not  deterministic.  In  other  words,  we  check 
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procedure  Compute-MaynfA) 

for  all  node  n  in  a  backward  topological  traversal  of  the  FCDG>t  do 
MayR(n)  :=  Use(n)  n  PatR{A) 
for  all  children  c  of  n  do 

MayR(n)  :=  Afaj/fl(n)  U  MayR(c) 
if  c  is  a  True  child  of  n  then 

True{n)  :=  True(n)  U  Bath{c) 
else  Falae(n)  :=  Fal8e(n)  U  Both{c) 
endall 

Bath{n)  :=  {Triie{n)  n  Foi«e(n))  U  Def{n) 
end  for 
end  procedure 

procedure  Compute- MusIr  (A) 

for  all  nodes  n  in  a  forward  topological  traversal  of  the  FCDG^  do 
Mu8tR(n)  :=  U8e{n)  O  PatR{A) 
for  all  parents  p  of  n  do 

if  n  is  a  True  child  of  p  then 

Af uafj?(n)  :=  Both(n)  U  Mu«t/{(p)  U  True(p) 
else  Mu3iR{n)  :=  Bot/i(n)  U  Mu3tR(p)  U  FoZ»e(p) 
endall 
endall 
end  procedure 

Algorithm  5.4:  Computing  Mayji  and  MusIr 

to  see  if  there  is  a  potential  access  einomaly  that  occurs  for  some  value  of  the  predicate 
of  n  but  is  not  guziranteed  to  occur  in  every  execution  that  includes  n.  The  Mayw  and 
Mustw  sets  aire  defined  and  used  symmetrically. 

There  are  two  specicd  cases  that  must  be  considered:  (1)  the  root  block  of  critical 
section  A  is  parallel  nondeterministic,  and  (2)  critical  section  A  has  interned  coordination 
nodes.  In  case  1,  critical  section  A  may  never  be  executed  at  aU;  in  case  2,  the  two  accesses 
may  have  different  concurrency  relationships.  Both  cases  are  handled  by  conservatively 
setting  Musti{{n)  to  be  empty  for  all  n  in  A.  The  routine  in  Algorithm  5.5  is  performed 
for  each  critical  section  A  to  check  for  reference  nondeterminism. 

The  complexity  of  detecting  reference  nondeterminism  is  five  traverseds  of  the  control 
dependence  graph  times  the  niunber  of  shared  variables.  Theorem  19  proves  that  Check- 
Reference  will  not  miss  any  reference  nondeterministic  constructs. 

To  illustrate  the  detection  of  reference  nondeterminism,  consider  the  control  depen- 
dence graph  for  the  critical  section  shown  in  Figure  5.11.  Suppose  that  Potw{A)  = 
{X,Y},  jD  is  a  deterministic  predicate,  and  /  is  an  indeterminate  predicate.  Mayw{I) 
and  Mustw{I)  are  both  {X},  and  Mayw{D)  and  Mustw{D)  are  {X,Y}  and  {X}  re- 
spectively.  Because  Mayw{I)  Q  Mustw{I),  A  will  be  correctly  identified  as  sequentifd 
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procedure  Check-Reference(A) 
Compute- May  n  (A) 
Compute-Mayw  (A) 
if  state(rocrf(i4))  =  deterministic  and  there  is  no  coordination  in  A  then 

Compute- Mii3tfi  (A) 

Compute- Muatw  (A) 
endif 
for  all  n  G  -A  such  that  3tate(n)  ^  deterministic  do 

if  Mayw^'n)  2  Musivf{n)  or  May;j(n)  2  Mualfi(n)  then  Reference  Nondeierminiaiic 
endfor 
end  procedure 


Algorithm  5.5:  Detecting  Reference  Nondeterminism 


[^  Entry 


lock(Z) 

T^.^^ 
F^"-^ 

T^-"^ 
F^"^.. 

X 

:=  2 

.      if/ 

X 

:=  3 

X 

:=  4 

*      ifZ? 

Y 

=  1 

unlock(i) 

Critical  section  A 
lock(I) 
if  (7)  then 

X  :=2 
else  X  :=  3 
if  (I>)  then 

X  :=A 
else  y  :=  1 
unlock(L) 


Figure  5.11:  Reference  Deterministic  Program 

deterministic.  Since  D  is  deterministic,  the  fact  that  May\y{D)  %  Mustw{D)  does  not 
affect  the  detection  of  reference  nondeterminism. 

Theorem  19  Every  reference  nondeterministic  construct  is  detected  by  procedure  Check- 
Reference. 


Proof.  Suppose,  for  the  sake  of  contradiction,  some  variable  X  with  a  reference  nonde- 
terministic use  in  critical  section  A  is  not  detected  by  Check- Reference.  We  assimae  that 
the  set  of  potentizdly  access  anomadies  vjiriables  is  correctly  computed,  and  therefore,  X 
must  be  in  Potf{{A)  If  a  use  U  of  X  occurs  in  some  execution  instances  and  not  in  others, 
U  must  be  control  dependent  on  a  set  of  nodes  TV  =  m 
consider: 


nj.  There  are  several  cases  to 
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1.  All  nodes  in  N  axe  deterministic.  Whether  or  not  U  occurs  is  independent  of  the 
critical  section  execution  order. 

2.  Some  node  in  iV  is  not  deterministic  and  is  not  in.  A.  U  cannot  be  in  j4  since 
Must/i(n)  =  0  for  aH  n  'm  A. 

3.  All  nodes  in  TV  that  are  not  deterministic  are  in  A.  1£  A  conteiins  coordination  then 
U  cannot  he  in  A  since  Musti{(n)  =  0  for  eJl  n  in  ^.  Otherwise,  X  £  Must^(n) 
for  every  n  in  iV  that  is  not  determinate.  Therefore,  whether  or  not  X  is  used  is 
independent  of  the  critical  section  execution  order. 

In  all  cases,  a  contradiction  is  reached.  D 

5.4     Heuristics  for  Nondeterministic  Critical  Sec- 
tions 

In  order  to  guarantee  that  a  program  and  input  vector  pair  is  anomaly  free,  an  execution 
instaince  must  be  generated  and  analyzed  for  each  possible  POEG.  Unfortunately,  this 
task  is  generally  intractable.  For  instance,  40,320  execution  instances  would  have  to  be 
cinalyzed  dynamically  in  order  to  guarantee  that  a  very  modest  parallel  program  with  8 
concurrent  critical  sections  does  not  contain  any  access  anomcdies. 

Because  of  the  basic  infeasibility  of  this  goal,  a  dynamic  einomaly  detection  system 
generally  wUl  not  attempt  to  guaraintee  anomaly  freedom.  Instead,  its  goal  is  to  ensure 
that  false  cinomcilies  are  not  reported.  In  this  case,  adtemative  representations  of  crit- 
ical sections  can  be  used  that  make  dynamic  anomaly  detection  more  effective.  Access 
anomjilies  will  not  be  "hidden"  by  coordination  edges  added  to  the  POEG  solely  to  reflect 
nondeterminism,  the  number  of  execution  instances  is  decreased,  and  the  structure  of  the 
POEG  is  simplified.  Section  5.4.1  describes  the  approach  of  ignoring  nondeterminism. 

Alternatively,  the  effects  of  parallel,  reference  cind  sequential  nondeterminism  on  a 
given  statically  detected  potential  zinomaly  can  be  Jincdyzed.  By  doing  so,  properties  can 
be  guaranteed  about  certziin  classes  of  potentizd  access  anomeilies.  Section  5.4.2  describes 
the  locfdization  technique. 

5.4.1     Ignoring  Nondeterminism 

When  the  requirement  of  detecting  anomaly  freedom  is  relaxed,  certain  types  of  nonde- 
terminism can  be  ignored. 

•  Whether  or  not  a  given  pzirallel  construct  is  sequential  nondeterministic  affects  the 
number  of  execution  instcinces  that  must  be  modeled  to  gucirantee  anomaly  freedom. 
However,  as  is  shown  by  Theorem  18,  sequentiad  nondeterminism  does  not  influence 

128 


the  reliability  of  the  vinordered  representation  to  detect  anomalies  within  a  given 
construct  or  cause  ialse  anomalies  to  be  reported  in  subsequent  code. 

•  It  may  be  valid  to  assume  that  reference  nondeterminism  is  always  unintentioncJ. 
There  is  no  obvious  use  of  reference  nondeterminism  that  increfises  the  power  or 
clarity  of  a  parsillel  program.  Given  this  assumption,  parallel  constructs  that  aie 
parallel  deterministic  but  potentially  reference  nondeterministic,  can  be  analyzed 
using  an  adaptive  access  anomaly  representation.  Initiadly,  the  unordered  represen- 
tation is  used.  However,  if  feilse  anomalies  are  reported,  the  ordered  representation  is 
required  to  correctly  model  the  reference  nondeterministic  behavior  of  the  program. 

Thus,  it  may  be  sufficient  for  a  static  analysis  system  to  detect  only  parallel  nonde- 
terminism, which  can  be  done  relatively  efficiently,  and  ignore  sequential  and  reference 
nondeterminism. 

In  generid,  peirallel  nondeterminism  cannot  be  simply  ignored.  As  was  illustrated  by 
many  previous  examples,  the  ordered  representation  is  often  needed  to  correctly  captiire 
the  semantics  of  useful  critical  section  behavior.  Fortunately,  many  important  uses  of 
critical  sections  exhibit  "restricted"  parallel  nondeterminism.  Parallel  nondeterminism 
is  restricted  if  the  actions  of  a  thread  are  not  based  on  information  about  all  of  the 
threads  that  entered  the  critical  section  before  it;  rather  they  are  based  on  information 
about  a  limited  nimaber  of  other  threads.  Knowledge  about  the  semantics  of  parallel 
nondeterministic  critical  section  coordination  can  be  used  to  add  only  the  minimal  amount 
of  ordering  to  the  POEG,  thereby  exposing  more  access  cinomalies  in  ciny  given  execution 
instance  eind  reducing  the  mmiber  of  execution  instjinces. 

The  remainder  of  this  section  describes  two  importeint  examples  of  restricted  parallel 
nondeterministic  coordination:  Fetch^cp  and  working  set  coordination.  Table  5.1  summa- 
rizes the  effects  of  parallel,  reference,  and  sequentizd  nondeterminism  on  access  anomaly 
detection  in  progr£ims  with  critical  section  coordination. 

Fetch&(/>  Coordination 

Fetch&.<p  is  a  class  of  read-modify- write  hardweure  instructions  provided  on  some  architec- 
tures; most  notably,  the  NYU  Ultracomputer  and  IBM  RP3  [Got88,PBG+85]''.  Fetch&(/> 
indivisibly  returns  the  current  value  of  some  Vciriable  X  and  then  performs  an  associative, 
bineiry  operation  <p  using  X  auid  a  passed  parameter.  If  hardwcire  support  is  not  provided, 
Fetch&(^  can  be  implemented  in  softwcire  using  critical  sections  as  is  shown  in  Figiire  5.12. 
Whether  provided  by  hardware  or  simulated  in  softwjire,  a  Fetch&<^  operation  is  correctly 


^Other  multiple  operation  instructions,  such  as  Test&cSet,  can  be  analysed  using  the  same  techniques 
presented  below  for  modeling  Fetch&^. 
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Conditions 

Properties 

Sequential  Determinidm 
Reference  Detenninism 
Parallel  Detenninism 

The  \uiordered  representation  will  not  report  false  anomalies. 
One  execution  instance  is  siifBcient  to  prove  anomaly  freedom. 
There  is  no  probe  effect  for  detecting  anomalies. 

Sequential  Nondeterminism 
Reference  Determinism 
Parallel  Determinism 

The  unordered  representation  will  not  report  false  anomalies. 

N !  execution  instances  reqviired  to  prove  the  subsequent  code  anomaly  freedom. 

Sequential  Determinism 
Reference  Nondeterminism 
Parallel  Determinism 

An  adaptive  representation  is  used  as  follows:  the  unordered  representation  is 
initially  used;  if  false  access  anomalies  are  reported  then  the  ordered  represen- 
tation is  used. 

N'.  execution  instances  required  to  prove  anomaly  freedom  for  the  ordered 
representation. 

Sequential  Determinism 
Reference  Determinism 
Fetch&i^  Coordination 

The  unordered  representation  wiU  not  report  false  anomalies  if  operations  are 
uniform. 

Nl  execution  instances  required  to  prove  anomaly  freedom. 

Sequential  Determinism 
Reference  Determinism 
Work  Set  Coordination 

The  token  based  ordered  representation  can  be  used. 

y !  execution  instances  required  to  prove  anomaly  freedom. 

Sequential  Determinism 
Reference  Detenninism 
Parallel  Nondeterminism 

The  ordered  representation  is  used. 

N'.  execution  instances  required  to  prove  anomaly  freedom. 

Table  5.1:  Summary  of  Heuristic  Representations 

modeled  in  dyneimic  access  anomaly  detection  systems  as  a  criticeJ  section. 

Fetcli&(^  has  proved  to  be  very  useful  in  writing  parekUel  programs.  To  illustrate 
the  power  of  Fetch&(^,  consider  the  program  in  Figirre  5.13  that  uses  the  self-service 
pjiradigm  [Got88].  In  this  paradigm,  whenever  a  thread  becomes  idle,  it  assigns  itself 
work  by  modifying  a  globad  variable.  The  global  variable  can  be  efficiently  maintciined  by 
using  Fetch&Add  operations.  In  the  program  in  Figure  5.13,  each  of  the  parallel  iterates 
performs  Fetch&Add  operations  to  assign  itself  a  series  of  indices  of  Jirray  A  to  process. 
Each  iterate  terminates  when  aH  100  entries  in  A  have  been  processed. 

A  set  of  Fetch&(/»  operations  are  uniform  when  ail  concurrent  Fetch&^  operations 
to  the  same  shjired  variables  are  identic£d  (e.g.  the  same  operation  <f>  and  parameter), 
such  as  in  the  program  in  Figure  5.13.  There  is  no  semantic  ordering  between  threads 
that  successively  perform  uniform  Fetch&<^  operations,  since  no  information  about  the 
program  state  of  threads  that  previously  executed  the  critical  section  is  known  except 
perhaps  how  mziny  other  threads  have  adready  performed  a  Fetch&i^  operation^.   Thus, 


'When  exactly  two  concurrent  threads  perform  Fetch&^  operations,  knowing  one  other  thread  has 
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function  Fetchk<j>(X,i) 

lock(X.i) 

imp  :=  X.value 

X.value  :=  <i>{X.value,  i) 

unlock(JC.L) 

return  imp 
end  function 


Figure  5.12:  Fetch&<^ 


initfX,  1) 
doall  (  :=  1  to  n 

r  :=  Fetch«dAdd(X,  1) 
while  r  7^  100  do 
A[r]  :=  /(r) 
r  :=  Fetch«iAdd(X,  1) 
end  while 
end  all 


Figure  5.13:  Self-service  work  assignment  with  Fetch&Add 

no  fedse  anomalies  will  be  reported  when  the  unordered  representation  is  used  to  model 
uniform  Fetch&(^  operations.  The  actions  performed  by  a  block  cein  be  dependent  on  the 
value  returned  by  a  Fetch&^  operation,  however,  eind  hence  Nl  execution  instamces  are 
still  required  to  guaxcmtee  that  the  program  does  not  contcdn  any  access  anomalies. 

Work  Set  Coordination 

In  work  set  coordination,  tokens  are  associated  with  shared  variables  that  control  access  to 
those  variables.  A  block  6  cjin  access  a  shared  VEiriable  only  if  it  has  the  token  Msociated 
with  that  V3iriable.  Once  b  obtains  the  token  from  the  work  set,  it  can  access  the  data 
associated  with  that  token  without  introducing  amy  anomalies.  When  6  finishes  accessing 
a  shared  Vciriable,  it  frees  the  associated  token  by  adding  it  to  the  work  set.  This  type  of 
coordination  may  be  exhibited  in  prograims  that  use,  for  example,  mailboxes,  queues,  or 
stacks  or  the  Linda  prograrrmiing  model  [GCCC85]®. 

To  illustrate  work  set  coordination,  consider  the  program  fragment  in  Figure  5.14. 
In  this  progTcun  a  shared  queue  stores  information  about  the  entries  in  an  tirray  that 
have  finished  the  initialization  stage  and  are  ready  for  the  computation  phase.  After  an 


executed  the  critical  section  is  equivalent  to  knowing  the  order  of  execution. 

We  do  not  address  the  issue  of  detecting  working  set  coordination,  a  problem  posed  by  Etorath  and 
Padua  [EP88]. 
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parallel  case 

doall  j  :=  1  to  3 

initialize  A[j] 

lock(L) 

enqueue(j) 

unlock(L) 
endall 
parallel 

doall  j  :=  1  to  3 

lock(I) 

dequeue(/:) 

unlock(L) 

process  Apt] 
endall 
end  parallel  case 

Figure  5.14:  Work  Set  coordination  example 


iterate  initializes  an  entry  A[j],  it  enqueues  the  index  j.  When  an  iterate  needs  an  entry 
to  process,  it  dequeues  the  index  of  an  initiadized  entry.  The  accesses  performed  in  the 
initialization  of  an  entry  i  never  overlap  with  the  processing  of  i  because  of  the  implicit 
coordination  through  the  index  i. 

Correct  access  to  the  work  set  pool  is  obtcdned  by  enclosing  all  operations  on  it  within 
critical  sections.  If  the  unordered  critical  section  representation  is  used  to  model  these 
critical  sections,  false  einomalies  can  be  reported.  Anomalies  are  correctly  detected  when 
the  ordered  critical  section  representation  is  used;  however,  too  much  ordering  is  imposed, 
since  the  edges  needed  to  represent  implicit  ordering  from  work  set  coordination  are  a 
subset  of  those  in  the  ordered  representation.  A  correct  and  efficient  representation  of 
work  set  coordination  only  orders  blocks  that  exchainge  a  token. 

A  work  set  implementation  of  critical  sections  reqiiires  that  the  dynamic  anomaly 
detection  system  knows  which  critical  sections  add  tokens  to  the  pool  and  which  extract 
tokens  and  lock  covers  must  be  used  to  model  accesses  to  the  pool.  When  a  block  B  adds  a 
token  T  to  the  pool,  the  block  identifier  of  B  is  associated  with  T.  When  a  block  D  deletes 
token  T  from  the  pool,  a  coordination  edge  is  added  to  the  POEG  from  B  (the  block  whose 
identifier  is  associated  with  T)  to  D  (the  deleting  block).  Thus,  only  those  blocks  that 
have  exchanged  some  state  information  that  could  be  used  in  subsequent  computations 
are  ordered  in  the  POEG.  This  ordering  is  generally  of  much  finer  granularity  than  the 
ordered  criticfd  section  representation. 

The  benefits  of  properly  modeling  work  set  coordination  are  similar  to  those  for  prop- 
erly modeling  any  coordination  primitive.  First,  each  block  is  ordered  with  only  a  minimal 
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set  of  other  blocks.  Therefore,  the  probability  of  detecting  an  access  cinomaly  in  any  given 
execution  instance  is  much  higher.  Second,  the  niunber  of  execution  instcinces  is  greatly 
reduced.  In  general,  there  axe  (y)!  rather  than  Nl  different  execution  instances  to  con- 
sider. 

The  advantage  of  explicitly  representing  work  set  coordination  can  be  seen  by  consid- 
ering £in  execution  instance  of  the  progrjim  in  Figure  5.14.  Figure  5.15  shows  the  POEG 


A[l]: 
enqueue(l), , 

0- 

A[2] 
enqueue(2), , 


dequeue(t) 

:=  A[Z] 


Figure  5.15:  POEG  for  Work  Set  Coordination 

for  the  execution  instance  in  which  iterates  ij,  12  &nd  13  respectively  add  tokens  1,  2  and  3 
to  the  pool,  eind  iterate  14  deletes  token  1,  iterate  15  deletes  token  2,  &nd  iterate  le  deletes 
token  3.  The  edge  from  12  to  ib  indicates  that  a  token  that  12  added  to  the  work  set  was 
removed  by  15. 
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First,  the  work  set  representation  adds  three  coordination  edges,  resulting  in  each 
thread  being  ordered  with  one  additioned  block.  The  ordered  representation  would  add 
five  coordination  edges,  so  that  each  thread  is  ordered  with  five  additionad  blocks.  Second, 
by  using  the  work  set  representation,  only  6  execution  instances  must  be  ainalyzed  to 
guarantee  that  the  program  in  Figure  5.15  is  anomaly  free.  In  contrast,  720  execution 
insteinces  would  have  to  be  considered  if  the  ordered  representation  were  used. 

5.4.2     Localizing  Nondeterminism 

In  generjd,  static  access  anomaly  detection  is  used  to  identify  a  set  of  potentied  access 
anoroeJies,  and  dynamic  detection  is  then  used  to  determine  if  these  are  actuad  access 
cinomalies.  Even  if  a  parallel  construct  is  nondetenninistic,  a  given  potential  access 
anonudy  in  that  construct  can  be  independent  of  the  execution  order  of  its  critical  section. 
We  say  that  an  access  anomaly  is  invariant  if  eind  only  if  neither  of  the  conflicting  ac- 
cesses is  parallel,  sequential  or  reference  nondetenninistic^.  This  property  is  determined 
by  producing  a  forwzird  slice  from  every  nondetenninistic  use  and  definition  found  using 
the  algorithm  described  in  Section  5.3 

For  example,  consider  the  parallel  nondeterministic  prograxn  in  Figure  5.16.  Although 


X  -.=  0 

doall  1  :=  1  to  2 
Y  :=  A[i\ 
lock(i) 
X  ■.=  X  +  1 
j:=X 
unlock(X) 
A\j\  :=Yxx 

endall 


Figure  5.16:  Local  affect  of  nondeterminism 

this  program  is  parallel  nondeterministic,  the  potential  access  anomalies  to  Y  are  invEiriajit 
cind  will  occur  regcirdless  of  the  critical  section  execution  order.  The  potentied  access 
anomedies  to  anay  A  are  not  invjiriant,  since  the  write  of  A  is  control  dependent  on  j 
which  is  peirallel  nondeterministic.  The  access  anomaly  to  A[2]  will  only  be  reported  in 
those  execution  instainces  in  which  iterate  2  enters  the  criticcd  section  before  iterate  1. 

Access  einomcdies  that  cire  invaricint  cam  be  reliably  detected  even  in  the  presence  of 
parallel,  reference  and  sequenticd  nondeterminism.  In  p£irticular,  in  the  absence  of  other 


^Thi5  is  similar  to  the  notion  of  static  anomalies  piopoaed  by  AUen  and  Padua  foi  the  hides  relationship 
[AP87].  Because  this  work  focuses  on  the  nondeteiminisni  stemming  from  critical  section  coordination,  a 
more  accurate  classification  is  possible. 
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access  emomalies,  one  execution  instance  is  sufficient  to  guarantee  that  a  POEG  deter- 
ministic program  with  critical  sections  and  input  vector  pair  is  invariant  cmomaly  free. 
(This  follows  from  the  fact  that  in  the  absence  of  other  access  anomalies,  every  sequential, 
paraillel  and  reference  deterministic  use  or  definition  occurs  in  every  execution  instance.) 
Therefore,  if  every  potential  access  anomaly  in  P  is  invaricint,  then  one  execution  instance 
is  sufficient  to  guairantee  that  P  and  /  is  amomaly  free. 
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Chapter  6 


Conclusions 


This  thesis  addresses  the  issue  of  detecting  a  specific  type  of  nondeterminism  in  shared 
memory  paraillel  progrzuns  known  as  zin  access  anomaly.  An  access  ainomaly  occurs  when 
an  update  to  a  shared  vjiriable  X  is  concurrent  with  either  a  read  of  X  or  zinother  update 
of  X.  Access  anomalies  aie  often  bugs.  Even  when  intention5d,  an  access  anomaly  intro- 
duces nondeterminism  which  makes  understanding  and  debugging  paraiUel  progreim  very 
diffictdt.  Thus,  it  is  essential  that  access  anomalies  aie  isolated-not  necessarily  so  that 
they  can  be  eliminated,  rather  so  that  they  are  made  explicit  and  their  effect  on  program 
behavior  can  be  considered.  Unless  access  anomalies  are  detected  eind  reported,  we  can 
have  little  confidence  in  the  reliability  of  pairallel  programs. 
The  primary  contributions  of  this  thesis  cire  as  follows: 

1.  Task  Recycling  Technique 

A  new  on-the-fly  access  anomaly  detection  algorithm  referred  to  as  task  recycling 
is  presented  in  Chapter  3.  Task  recycling  is  a  significant  improvement  over  existing 
techniques  in  seversd  ways.  First,  it  supports  a  wide  class  of  pairallel  progrcmas.  All 
previous  techniques  support  only  a  subset  of  common  parallel  programming  models. 
For  instiince,  none  supported  the  Ada  tasking  model. 

Second,  task  recycling  is  more  efficient  for  many  programs.  It  minimizes  the  cost  per 
variable  access  (since  this  generally  is  assumed  to  be  the  most  common  operation), 
and  requires  an  jimount  of  space  that  is  a  function  of  the  maximum  peirfdlelism  in 
the  program,  rather  thain  the  total  length  of  the  execution. 

In  addition,  we  show  how  the  task  recycling  technique  cam  be  integrated  with  trace- 
and-replay  and  event-based  debugging  systems  to  improve  their  reliability. 

2.  Empirical  Measurements 

While  several  on-the-fly  access  anomaly  detection  algorithms  have  been  proposed, 
none  had  been  implemented  cind  little  was  known  about  the  actual  costs  of  detecting 


anomalies  in  "real"  paridlel  programs.  To  this  end,  Chapter  4  presents  empirical 
measurements  gathered  from  monitoring  severed  scientific  programs  using  the  task 
recycling  technique  and  an  alternative  technique,  known  as  English-Hebrew  labeling. 

We  show  that  progriim  behavior  that  is  common  in  parallel  scientific  codes  allows  for 
inexpensive  detection  of  access  anomalies.  The  actuad  overheads  that  are  incurred 
(less  that  350%  slowdown  for  the  benchmark  programs)  indicate  that  on-the-fly 
anomaly  detection  is  efficient  enough  to  be  a  viable  debugging  tool.  Moreover,  task 
recycling  compares  favorably  with  respect  to  English-Hebrew  labeling,  both  in  terms 
of  time  and  space. 

3.  Critical  Section  Nondeterniinistn 

Chapter  5  presents  a  new  technique  for  representing  criticfd  section  coordination 
that  greatly  increases  the  effectiveness  of  dyneimic  anomaly  detection.  This  rep- 
resentation Ccin  be  used  when  the  execution  order  of  concurrent  criticcd  sections 
does  not  affect  the  execution  of  the  rest  of  the  progTcim.  (For  instance,  consider 
a  progrzim  in  which  the  critical  sections  are  used  to  compute  a  mciximum.)  We 
present  algorithms-based  on  static  compiler  optimization  techniques-for  detecting 
several  different  tjrpes  of  nondeterminism  that  can  arise  from  criticzd  section  coor- 
dination. The  absence  of  these  types  of  nondeterminism  determines  when  the  the 
new  representation  of  critical  sections  is  reliable. 

The  techniques  used  to  anedyze  criticed  section  coordination  give  insight  into  similar 
edgorithms  for  other  coordination  primitives:  for  excimple,  the  Ada  rendezvous.  A 
better  understanding  of  the  behavior  of  critical  section  can  also  lead  to  more  efficient 
trace-and-replay  debugging  and  testing  of  parzdlel  prograxns. 
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